Bug #1131

puppetmaster memory leak

Added by Redmine Admin about 4 years ago. Updated almost 4 years ago.

Status:Closed Start date:
Priority:Normal Due date:
Assignee:Redmine Admin % Done:

0%

Category:-
Target version:0.24.4
Affected Puppet version: Branch:
Keywords:
Votes: 0

Description

afert upgrading from 0.23.2 to 0.24.2 puppetmaster eats up server memory.

it logs

Puppet (err): Could not call: Cannot allocate memory - fork(2)

until restarted or crash.

puppetmaster-memory-week.png (62.1 kB) Redmine Admin, 03/12/2008 05:57 pm

memory-month-puppet.png (62.2 kB) Redmine Admin, 03/14/2008 01:42 am

diff (327 Bytes) Redmine Admin, 03/14/2008 07:53 pm

memory_profiler.rb (2.8 kB) Luke Kanies, 03/17/2008 12:06 am


Related issues

related to Puppet - Bug #1010: puppet/puppetmaster randomly corrupts file{} resources, s... Closed

History

Updated by Luke Kanies about 4 years ago

Is there any more information you can give about your environment? Lots of file serving? None? How often does the memory spike happen? How quickly?

Updated by Redmine Admin about 4 years ago

  • server manages 40 nodes
  • nodes located in other networks, behind NAT
  • apache2 in front of mongrel+puppet
  • serving only some configuration files. not lots of file serving.
  • brings Fedora 5 on 2GB RAM CPU x4 Xeon to its knees after 4 / 5 days. we are now restarting it daily
  • rubyversion 1.8.6, puppetversion 0.24.2
  • use of external_nodes
  • external_nodes script calls wget which retrives node info from rails app

May this be related to #1130 ?

Updated by AJ Christensen about 4 years ago

Possibly related to http://reductivelabs.com/trac/puppet/ticket/1010 – only one of the ‘side effects’ of memory leaking was fixed, not the cause;

Updated by Redmine Admin about 4 years ago

yes. it also replaced some files with empty directories in one of its crisis.

Updated by Redmine Admin about 4 years ago

Replying to [comment:4 descala]:

yes. it also replaced some files with empty directories in one of its crisis.

Did it do that under 0.24.2? If, so, we have a larger problem on our hands, since that means the work-around that caused #1010 to be closed doesn’t work.

Updated by Redmine Admin about 4 years ago

Replying to [comment:5 adamhjk]:

Replying to [comment:4 descala]:

yes. it also replaced some files with empty directories in one of its crisis.

Did it do that under 0.24.2? If, so, we have a larger problem on our hands, since that means the work-around that caused #1010 to be closed doesn’t work.

yes. 0.24.2 on server and clients.

after puppetmaster restart got this kind of report in 4 nodes out of 40. puppet trying to set files right:

Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (notice): Recursively backing up to filebucket
Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (notice): Not removing directory; use 'force' to override
Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (notice): Recursively backing up to filebucket
Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (notice): Not removing directory; use 'force' to override
Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (err): Could not rename tmp /usr/local/lib/ruby/site_ruby/1.8/facter/rubygems_version.rb for replacing: Is a directory - /usr/local/lib/ruby/site_ruby/1.8/facter/rubygems_version.rb.puppettmp or /usr/local/lib/ruby/site_ruby/1.8/facter/rubygems_version.rb

no files were corrupted. only changed to an empty directory.

Updated by Redmine Admin about 4 years ago

I tried to re-open #1010 in reference to this bug, but I can’t seem to figure out where in the Trac UI one would go about doing that. If #1010 is not fixed, people should be advised to use 0.23.2 until we can validate that corrupted file resources will no longer be utilized on the client.

Are we checksum-ing the corrupted resources?

Either this bug or a re-opened #1010 should have its priority raised to highest, and severity bumped to critical. If we can’t re-open #1010, it should be this one.

Updated by Luke Kanies about 4 years ago

Anyone want to volunteer to help debug the problem, since no one suffering it has attempted to study it, from what I can see, and I can’t reproduce it.

Updated by Blake Barnett about 4 years ago

Here’s a curious log snippet from a machine that got corrupted files (client and server both running 0.23.2):

Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[os]/Node[os::redhat]/File[/etc/filter-syslog/redhat]/source (err): Could not describe /os/Redhat/filter-syslog: can't modify frozen object
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[newsyslog]/File[/etc/filter-syslog.conf]/source (err): Could not describe /newsyslog/filter-syslog.conf: wrong status line: ""
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Nodecommit:postfix]/File[/etc/filter-syslog/postfix/source (err): Could not describe /postfix/filter-syslog: wrong status line: ""
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[afs]/File[/usr/vice/etc/CellServDB]/source (err): Could not describe /afs/CellServDB: wrong status line: "400 Bad Request"
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[afs]/File[/etc/filter-syslog/afs]/source (err): Could not describe /afs/filter-syslog: wrong status line: ""
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[ntp]/File[/etc/ntp/ntpservers]/source (err): Could not describe /ntp/ntpservers: wrong status line: "

Bad Request

" Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[os]/Node[os::redhat]/File[/etc/profile.d/usrlocal.sh]/source (err): Could not describe /os/Redhat/profile.d/usrlocal.sh: wrong status line: "

Your browser sent a request that this server could not understand.
" Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[ntp]/File[/etc/filter-syslog/ntp]/source (err): Could not describe /ntp/filter-syslog: wrong status line: "Reason: You're speaking plain HTTP to an SSL-enabled server port.
" Tue Mar 11 14:29:52 -0700 2008 /Main[top]/Node[basenode]/Node[tmpclean]/File[/etc/cron.daily/tmpwatch]/source (err): Could not describe /tmpclean/tmpwatch: wrong status line: "Instead use the HTTPS scheme to access this URL, please.
" Tue Mar 11 14:29:52 -0700 2008 /Main[top]/Node[basenode]/Node[virt_all_users]/Node[client_gsbweb]/Remotefile[/usr/share/ssl/openssl.cnf]/File[/usr/share/ssl/openssl.cnf]/source (err): Could not describe /dist/clients/gsbweb/usr/share/ssl/openssl.cnf: wrong status line: "

Hint: https://henson.stanford.edu:8140/

" Tue Mar 11 14:29:53 -0700 2008 /Main[top]/Node[basenode]/Node[gsb-vatuqa]/Node[client_gsbweb_gsb-vatuqa]/Node[client_gsbweb_gsb-vatu-common]/Remotefile[/etc/newsyslog.monthly/tomcat]/File[/etc/newsyslog.monthly/tomcat]/source (err): Could not describe /dist/clients/gsbweb/etc/newsyslog.monthly/tomcat.gsb-vatuqa: wrong status line: "
" Tue Mar 11 14:29:53 -0700 2008 /Main[top]/Node[basenode]/Node[postfix]/Postmap[/etc/postfix/transport]/Remotefile[/etc/postfix/transport]/File[/etc/postfix/transport]/source (err): Could not describe /postfix/transport: wrong status line: "
Apache/2.2.3 (Debian) proxy_html/2.5 mod_ssl/2.2.3 [[OpenSSL]]/0.9.8c [[WebAuth]]/3.5.3 Server at henson.stanford.edu Port 8140
" Tue Mar 11 14:29:53 -0700 2008 /Main[top]/Node[basenode]/Node[user_root]/Remotefile[/root/.bashrc]/File[/root/.bashrc]/source (err): Could not describe /dist/users/rootuser/bashrc: wrong status line: "" Tue Mar 11 14:29:55 -0700 2008 /Main[top]/Node[basenode]/Node[xinetd]/Service[xinetd] (notice): Triggering 'refresh' from 2 dependencies Tue Mar 11 14:30:03 -0700 2008 /Main[top]/Node[basenode]/Node[virt_all_users]/Node[client_gsbweb]/Node[apache]/Exec[apache-reload] (notice): Triggering 'refresh' from 1 dependencies Tue Mar 11 14:30:05 -0700 2008 /Main[top]/Node[basenode]/Node[iptables]/Exec[rebuild_iptables] (notice): Triggering 'refresh' from 4 dependencies

Specifically it looks like things start going south with “can’t modify frozen object”, maybe a red herring?

Updated by Luke Kanies about 4 years ago

I spent all day Friday and Saturday trying to reproduce this and was not able to.

If you would like to do stand-alone testing, it’s very easy using tools I’ve already written (

ext/puppet-test
, in the repository. You can specify as many repetitions as you want with
—repeat N
, and as many forks as you want with
—fork
(although I’ve had more luck using separate processes).

It accepts any normal Puppet argument (e.g., server, certname).

To test configuration compiling, use:

sudo puppet-test

To test file retrieving, use:

# Note that you should leave off the "puppet://" here.
sudo puppet-test --retrieve "/path/to/file"

To test file describing, use:

sudo puppet-test --describe "/path/to/file"

You can peruse the file and see what other suites are available. I’ll be making a commit today that adds the ability to repeat the server-side tasks, like compiling and parsing.

Updated by Luke Kanies about 4 years ago

I’ve added a library that I was using to try to track memory growth. To use it, stick it in your RUBYLIB somewhere, then add this to your puppetmasterd:

require 'memory_profiler'
[[MemoryProfiler]].start(:deltas => false)

Add it around line 281 — right before start/join stuff done on webrick/mongrel.

This file will create /tmp/memory_log and update a file in there with info on what instances it thinks are being retained across GC runs. Basically, you get a ratio of current instances to the number of those that were also present 10 seconds ago. It’s sorted by total instances, so String always shows up first, but you should be able to see anything that grows quickly based on memory leakage. (Assuming the memory leakage is of Ruby objects, not leakage in Ruby C libraries.

Updated by AJ Christensen about 4 years ago

Thanks for that Luke – I’ve got your memory profiler loaded on a mongrelized puppetmaster runing inside a screen session with —trace now as per our IRC conversation. Version is 0.24.1. Memory is climbing slowly. When It gets upwards of 250MB~ is when I start seeing issues, so I’ll paste up the memory profiler stuff then.

Updated by AJ Christensen about 4 years ago

14:56 < fujin> lak: 180MB ram usage 14:56 < fujin> Top 20 retained 14:56 < fujin> Array: 362/31875 (0.01) 14:56 < fujin> String: 2/159653 (0.00) 14:56 < lak> any puppet classes showing up in the list? 14:57 < lak> if you want better data, it’s probably more useful to sort by the ratio 14:57 < fujin> That’s it 14:57 < fujin> (with no active connections) 14:57 < lak> i thought about adding it, but it was like 11pm last night (sat night for me) and i

         just couldn't convince myself to

14:57 < fujin> Just those two :0 14:57 < fujin> when a node connects, more things happen 14:57 < lak> really? 14:57 < lak> ah, ok 14:57 < lak> but then it’s all reclaimed 14:57 < lak> ok 14:57 < fujin> all of your Puppet magical classes are instantiated, then GC’d correctly 14:57 < lak> so that pretty clearly indicates it’s a ruby leak, not a puppet leak

Valgrinding it now, It’s obviously an Array of some-sort somewhere in Puppet which is causing this leak – None of the puppet classes are the leaky ones. I’m hoping the valgrind information will be helpful.

Updated by AJ Christensen about 4 years ago

What a fail.

14:56 < fujin> lak: 180MB ram usage
14:56 < fujin> Top 20 retained
14:56 < fujin> Array: 362/31875 (0.01)
14:56 < fujin> String: 2/159653 (0.00)
14:56 < lak> any puppet classes showing up in the list?
14:57 < lak> if you want better data, it's probably more useful to sort by the ratio
14:57 < fujin> That's it
14:57 < fujin> (with no active connections)
14:57 < lak> i thought about adding it, but it was like 11pm last night (sat night for me) and i
             just couldn't convince myself to
14:57 < fujin> Just those two :0
14:57 < fujin> when a node connects, more things happen
14:57 < lak> really?
14:57 < lak> ah, ok
14:57 < lak> but then it's all reclaimed
14:57 < lak> ok
14:57 < fujin> all of your Puppet magical classes are instantiated, then GC'd correctly
14:57 < lak> so that pretty clearly indicates it's a ruby leak, not a puppet leak

Updated by AJ Christensen about 4 years ago

Hmph; valgrind makes it so slow that it cannot even compile configurations in time. Is there some way to adjust the timeout for configuration compilation?

Updated by AJ Christensen about 4 years ago

Blagh, genconfig answered my question. Going to see if I can get it to run with a configtimeout of 300 now.

Updated by Redmine Admin about 4 years ago

Okay. If you have an actively flakey puppet master, here are the steps you can follow to attach to it with GDB and do some introspection (in Ruby, no less!)

h1. Attaching to puppetmasterd with GDB

h2. Install GDB

Follow whatever steps make sense for your platform. Something like:

$ yum install gdb

Or

$ apt-get install gdb

Would work just fine.

h2. Populate your .gdbinit and .gdb directories

In ~/.gbinit:

define session-ruby
  source ~/.gdb/ruby
end

Next, you need to download http://eigenclass.org/hiki.rb?c=plugin;plugin=attach_download;p=ruby+live+process+introspection;file_name=ruby

And place it in ~/.gdb/ruby.

$ mkdir -p ~/.gdb
$ wget 'http://eigenclass.org/hiki.rb?c=plugin;plugin=attach_download;p=ruby+live+process+introspection;file_name=ruby'
$ mv ruby ~/.gdb

h2. Attach to the puppetmaster process, load the ruby helpers, and redirect stdout:

$ ps ax | grep puppemasterd
21170 ttyp5    S+     0:02 /usr/bin/ruby /srv/ruby/1.8.6/bin/puppetmasterd --no-daemonize --debug --trace

The important part here is the very first segment of the line (the PID).

$ gdb /usr/bin/ruby 21170

The first argument to gdb should be the full path to the ruby executable that is running your puppetmasterd.

h2. Start your debug session

From above, you should see a bunch of output like this:

...
Loaded symbols for /srv/ruby/1.8.6/lib/ruby/1.8/i686-linux/nkf.so
0xbfffe402 in +kernel_vsyscall ()

And then you’ll wind up at…

(gdb)

First things first, load up the ruby helpers.

(gdb) session-ruby

Then, redirect stdout to a file, so we can capture the output for later.

(gdb) redirect_stdout

Open up a second terminal, and you can watch the output from your debug session:

$ tail -f /tmp/ruby-debug.PID

Where PID above is equal to the PID you attached to with gdb.

Now that you are watching, flip back to your gdb session and..

h2. Get some data

(gdb) eval "caller"
(gdb) rb_object_counts
(gdb) eval "total = [[ObjectSpace]].each_object(Array){|x| puts '---'; puts x.inspect }; puts \"---\nTotal Arrays: #{total}\""

You should see a stack trace, a long count of the number of objects in the system. The last line is important, was Luke and Fujin think they are leaking Array’s specifically. That will print the value of every array the puppetmasterd has, separated by —–.

It should be a huge amount of data, but it ought to get us pointed in a more specific direction regarding the source of the Array’s.

Updated by James Turnbull about 4 years ago

Adam

Can you put this in the wiki – perhaps in the FAQ?

Thanks

Updated by Luke Kanies about 4 years ago

Note this thread on leaks, too:

http://groups.google.com/group/god-rb/browse_thread/thread/1cca2b7c4a581c2

I ran through the whole codebase and fixed all instances of this bug, and pushed it to the memory_leak_split_fix branch in my repository. In my testing, memory grows some, but stops growing at around 50MB and actually drops periodically, which never happened before.

Updated by Redmine Admin about 4 years ago

Okay, some more updates.

We discovered a very large (9K+ entries) array that seems to be related to the tagging of Puppet::Parser::Resources. You can see the array by doing something like:

(gdb) eval "by_size = Hash.new; total = [[ObjectSpace]].each_object(Array){|x| by_size[x.object_id] = x }; by_size.sort { |a,b| ar1.length <=> br1.length }.each { |elem| puts \"OBJ #{elemr1}: #{elemr0.inspect}\" }; nil"

That will print the object ID: size for every array. I’ve been doing something like:

(gdb) eval "ObjectSpace.each_object(Array){|x| puts x.inspect if x.length == LENGTH}; nil"

Where LENGTH is the output after the colon from above. That’ll show you the actual array contents that have grown so large.

Then:

(gdb) eval "total = [[ObjectSpace]].each_object(Puppet::Parser::Resource){|x| puts '---'; puts x.inspect }; puts \"---\nTotal Parser::Resource: #{total}\""

Will print out all the Puppet::Parser::Resource objects, which I believe is where that array is hiding. Track that guy down, and I think you’ll have all you need to figure this out.

Unless your Luke, at which point you might have enough to figure that leak out by poking around the tag stuff got implemented in the parser. :)

More to come.

Updated by Redmine Admin about 4 years ago

If you are affected by 1031, and also seeing issues with #1010 (files being replaced with bad contents, or turned magically into directories) please head over to #1010 and read my instructions. We need some more sample data, so we can narrow down whether #1010 and #1031 are actually related, or just friendly.

I promise to make it easy. Your gdb output could be the thing that helps fix 1010! Won’t that feel great?

Love, Adam

Updated by Luke Kanies about 4 years ago

The array is growing because of line 239 in lib/puppet/parser_support.rb.

I reproduced the problem with puppet-test, the ‘local_catalog’ suite. I just watched the biggest array and waited for one that got ever-bigger to show up, then froze it and watched for traces. It pointed me right to this line.

Well, technically, I just froze one that got above a certain size; that was pretty easy.

Now to see if I can fix it.

Updated by Luke Kanies about 4 years ago

The following diff fixes it, but apparently breaks something else:

diff --git a/lib/puppet/parser/parser_support.rb b/lib/puppet/parser/parser_support.rb
index ccfc4d4..4180806 100644
--- a/lib/puppet/parser/parser_support.rb
+++ b/lib/puppet/parser/parser_support.rb
@@ -235,8 +235,10 @@ class Puppet::Parser::Parser
             end
         end

-        unless filename == mod and ! @loaded.include?(mod)
-            @loaded << mod
+        return loaded if loaded or filename == mod
+
+        unless @loaded.include?(filename)
+            @loaded << filename
             # Then the individual file
             begin
                 import(filename)

Updated by Luke Kanies about 4 years ago

Okay, I fixed the leaking array in [f6325dc].

My changelog says I’m going to assume that this fixes the actual memory leak, although I’m personally a bit skeptical, since this leaking array should get reclaimed every time the files are reparsed, and people don’t seem to be seeing memory growing and then dropping.

Should I close this ticket and assume it’s fixed until I get other feedback, or what?

Is anyone willing to test it for me?

Updated by Luke Kanies about 4 years ago

  • Status changed from 1 to Closed
  • 7 set to fixed

I’ve also applied, in commit [18320b8], my code that fixes the split problem discussed in the ‘god’ memory leak at http://snurl.com/21zf8.

Given that this is two problems fixed, I’m going to close the ticket. If there are further memory growth issues, please let me know.

Updated by AJ Christensen about 4 years ago

I can’t see those commits in the 0.24.x branch – are they in another branch? Will cherry-pick and do testing to see if it is resolved once I work out where they are ;>

Updated by Luke Kanies about 4 years ago

They’re in commit:18320b8e3271f7d1d1702907be1ff420acfc8d2b and commit:f6325dceb3b10c300f421f540281bbd64bdc091e. I might have forgotten to push them last night.

Updated by AJ Christensen almost 4 years ago

  • Status changed from Closed to Unreviewed

Updated by AJ Christensen almost 4 years ago

Woops :>

Updated by James Turnbull almost 4 years ago

  • Status changed from Unreviewed to Closed

Also available in: Atom PDF