Bug #1973

Performance problems

Added by Hari Sekhon almost 3 years ago. Updated over 1 year ago.

Status:Closed Start date:02/16/2009
Priority:High Due date:
Assignee:- % Done:

0%

Category:-
Target version:-
Affected Puppet version:0.24.7 Branch:
Keywords:
Votes: 1

Description

Puppetd is chewing up a ridiculous amount of CPU and the boss at work has unfortunately noticed and described it as “unacceptable”.

I would really like to see puppet run more efficiently. We are running fast multi-cpu servers with multi-cores have servers running oracle databases and computational services so when someone says that puppet is making a noticeable dent in cpu time, I don’t have much of a come back for that, other than “that’s life!”. I understand that the multi-cpus and core probably makes no difference as I believe this is pretty much single threaded but the point is they are new and fast servers.

My fear is that my boss will turn around and say “Get rid of that bloatware, I want my servers to go fast…”

Unfortunately my boss is not the only person who has called puppet a resource hog… help!

History

Updated by Andrew Shafer almost 3 years ago

  • Status changed from Unreviewed to Needs More Information

What version of Puppet are you running?

What is the system setup? OS version? 64 bit? CPUs? RAM?

What is ridiculous amounts of CPU? What is the cpu when the agent is applying a catalog? What is the cpu when the agent is idle?

There are performance optimizations coming, but any data you can provide might help.

Updated by Hari Sekhon almost 3 years ago

What version of Puppet are you running?

Mainly 0.24.6, a couple are 0.24.4

What is the system setup? OS version? 64 bit? CPUs? RAM?

CentOS 5 combination of x86 and x86_64 systems. Dual Xeons 2 – 2.8GHz, dual or quad cores. Gentoo x86 and x86_64 Dual Opeterons 2GHz

What is ridiculous amounts of CPU?

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23061 root 25 0 815m 812m 2536 R 100 20.1 403:16.41 puppetd
13108 userX 25 0 98596 94m 496 R 100 2.3 118:46.72 ProgramX

ProgramX is a proprietary company program that does computation and runs nearly continuously throughout the day at 100% CPU usage.

So as you can see from cpu time, puppetd is consuming huge amounts of cpu relatively. Perhaps ProgramX hasn’t been living as long as puppet to rack up the cpu time but when puppet runs it is chewing up 100% CPU and contending with ProgramX, which is doing serious work.

400+ hours of cpu time to copy over a few configuration files and restart one or two services!! Not to mention the 800MB of Ram it is consuming…

Here is a bad case on another server from yesterday that I got a chewing out for:

Cpu1 : 99.7% us, 0.1% sy, 0.0% ni, 0.2% id, 0.0% wa, 0.0% hi, 0.0% si Cpu2 : 0.8% us, 6.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.3% si Cpu3 : 0.2% us, 1.3% sy, 0.0% ni, 98.4% id, 0.0% wa, 0.0% hi, 0.1% si Mem: 4147412k total, 4006216k used, 141196k free, 4792k buffers Swap: 0k total, 0k used, 0k free, 2140744k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26281 root 25 0 1761m 1.7g 488 R 99.8 43.4 70:43.48 puppetd

So puppetd is comsuming approaching 2GB of Ram and chewing up 100% cpu!!! It’s single threaded, not using the other cpus/cores by the look of it.

Yesterday puppet also caused a database machine to die twice from Out-of-memory killer (not surprising since it took an extra 2GB out of the system!). This box is running a 1.5TB oracle database with 8GB of ram.

So puppet causing havoc on these quite powerful machines that were previously very solid is a seriously achievement!!

What is the cpu when the agent is applying a catalog?

100%

What is the cpu when the agent is idle?

0% when idle as you can see above

Updated by James Turnbull almost 3 years ago

  • Affected Puppet version changed from 0.24.7 to 0.24.6

Updated by Luke Kanies almost 3 years ago

If Puppet is taking 100% of your cpu for more than the brief periods of time that the catalog is running, you’ve got something else going on. That’s certainly not normal.

Might you be running a Debian variant with -w 0 and not yet have a certificate? That’s a problem in the Debian package and is the only case I know of where puppetd might take significant cpu time beyond its catalog run.

Or have you tuned the runinterval so that puppetd is running constantly?

Updated by Brice Figureau almost 3 years ago

harisekhon wrote:

CentOS 5 combination of x86 and x86_64 systems. Dual Xeons 2 – 2.8GHz, dual or quad cores. Gentoo x86 and x86_64 Dual Opeterons 2GHz

What version of ruby are you using? Some people reported strange things with ruby 1.8.5.

Also are you using recursive file distribution (ie file { “/path/to”: source => “…”, recursive => true }). If yes, then that might be the cause of your memory consumption issue (see bug #1469), in a word, when using recursive distribution, puppetd has to scan all the path below the copy point. If this subtree is deep and dense, it can take quite some time and use lots of memory to build all the md5 checksum to see if files need to get synchronized or not. My advice if you use such recursive copy is to limit the recursion to 2 or 3 (ie recurse => 3).

Updated by Hari Sekhon almost 3 years ago

ruby-1.8.5 and ruby-1.8.6

I have used one or two recursive file copies, but their depth is usually no more than ½ dirs.

  • Shouldn’t puppet only do md5 sums for files for which remote copies are found?
  • Shouldn’t the remote copies already have an MD5 from a puppetmaster where puppet:// is used instead of having puppetd repeat this MD5 across the whole infrastructure needlessly?

If I was to write a program in python to md5 10-20 files, I would expect that to finish in 1-2 secs, not be taking minutes of 100% cpu and causing OOM on loaded machines.

(I do not want to disable md5 checking in favour of ctime/mtime)

  • Does puppet still use md5s if you are only setting perms and not sourcing files? (ie no source param is given)
  • I want recurse true and not 2/3 when ensuring perms on a directory tree consisting of 50-100 files, it should only be statting them and be really fast, right?

I’ve been through every recurse in every manifest (and I have a default of false in site)

There are only a few recurses set for dirs in which there are no source files set, only owner/group/mode settings, so if these are MD5ing without intelligence, then this would explain a lot.

Updated by Brice Figureau almost 3 years ago

Preamble:

You didn’t provide any information about how long puppetd is using 100% of CPU, is it always or only when it checks out its configuration from the master.

Can you also try to run one of your client in debug mode, then look at the log when puppetd is using 100% of cpu to see what it is doing. That’ll help diagnose the issue.

harisekhon wrote:

ruby-1.8.5 and ruby-1.8.6

I have used one or two recursive file copies, but their depth is usually no more than ½ dirs.

How many files are there (on the client side, not on the source side)?

  • Shouldn’t puppet only do md5 sums for files for which remote copies are found?

That’s the whole point of bug #1469.

  • Shouldn’t the remote copies already have an MD5 from a puppetmaster where puppet:// is used instead of having puppetd repeat this MD5 across the whole infrastructure needlessly?

The source MD5 is computed by the master I guess, but each client has to compute the local version of the checksum file. Someone could have changed the file behind puppet’s back.

If I was to write a program in python to md5 10-20 files, I would expect that to finish in 1-2 secs, not be taking minutes of 100% cpu and causing OOM on loaded machines.

I never said it was the cause of your issues. I just wanted to point that if you have on your clients tons of files that are behing a recursive file{} with source/checksum on, it can take some time and memory for puppet to stat(1) all this files, builds the checksums (and cache it). If you have only a handful of file, then this is certainly not the issue.

(I do not want to disable md5 checking in favour of ctime/mtime)

  • Does puppet still use md5s if you are only setting perms and not sourcing files? (ie no source param is given)

My understanding is that any file {} resource with checksum => XXX will compute the checksum, independently of using a source or not.

  • I want recurse true and not 2/3 when ensuring perms on a directory tree consisting of 50-100 files, it should only be statting them and be really fast, right?

Yes, provided it doesn’t compute checksum on those. Maybe try “checksum => undef” on those recursive file{} can help?

I tried the following little example:

File { checksum => md5 }
file {
    "/tmp/test":
        recurse => true, mode => 0644
}

Under /tmp/test I have several dirs and files. let’s look at the result

[brice@arsenic]% RUBYLIB=`pwd`/lib bin/puppet --debug --trace /tmp/test.pp
debug: File[/tmp/test]: Adding default for checksum
debug: Creating default schedules
debug: Failed to load library 'ldap' for feature 'ldap'
debug: Failed to load library 'shadow' for feature 'libshadow'
debug: Finishing transaction 69992060055820 with 0 changes
debug: //File[/tmp/test]/checksum: Initializing checksum hash
debug: //File[/tmp/test]/checksum: computed checksum 
debug: //File[/tmp/test]: Creating checksum {mtime}Thu Feb 19 11:11:49 +0100 2009
debug: //File[/tmp/test/a]/checksum: Initializing checksum hash
debug: //File[/tmp/test/a]/checksum: computed checksum 
debug: //File[/tmp/test/a]: Creating checksum {mtime}Thu Feb 19 11:11:50 +0100 2009
debug: //File[/tmp/test/a/b]/checksum: Initializing checksum hash
debug: //File[/tmp/test/a/b]/checksum: computed checksum 
debug: //File[/tmp/test/a/b]: Creating checksum {mtime}Thu Feb 19 11:11:51 +0100 2009
debug: //File[/tmp/test/a/b/c]/checksum: Initializing checksum hash
debug: //File[/tmp/test/a/b/c]/checksum: computed checksum 
debug: //File[/tmp/test/a/b/c]: Creating checksum {mtime}Thu Feb 19 11:11:58 +0100 2009
debug: //File[/tmp/test/a/b/c/e]/checksum: Initializing checksum hash
debug: //File[/tmp/test/a/b/c/e]/checksum: computed checksum 
debug: //File[/tmp/test/a/b/c/e]: Creating checksum {md5}d41d8cd98f00b204e9800998ecf8427e
debug: Finishing transaction 69992062707440 with 0 changes

Note how puppet created checksums, even though you don’t need them.

Using this manifest:

File { checksum => md5 }
file {
    "/tmp/test":
        recurse => true, mode => 0644n checksum => undef
}

I get the correct result:

[brice@arsenic]% RUBYLIB=`pwd`/lib bin/puppet --debug --trace /tmp/test.pp
debug: Creating default schedules
debug: Failed to load library 'ldap' for feature 'ldap'
debug: Failed to load library 'shadow' for feature 'libshadow'
debug: Finishing transaction 69843146739140 with 0 changes
debug: Finishing transaction 69843149395900 with 0 changes

Note: this time it didn’t compute unneeded checksums.

I’ve been through every recurse in every manifest (and I have a default of false in site)

There are only a few recurses set for dirs in which there are no source files set, only owner/group/mode settings, so if these are MD5ing without intelligence, then this would explain a lot.

So my point: if you have checksum => md5 (or mtime) as default, and you don’t want checksum computations on every file for those recursive file {}, selectively checksum => undef on them and see if that’s better.

If you want to really debug the issue, disable puppetd on one of your server, then launch puppetd manually with —test —debug —trace and see in the console output if there are checksum computations for lots of files.

Updated by Hari Sekhon almost 3 years ago

  • Affected Puppet version changed from 0.24.6 to 0.24.7

The documentation says you can use checksum => nosum but when I try this I got

Failed to retrieve current state of resource: Invalid checksum type nosum

Puppet 0.24.7

Surely nosum should be valid?

I went through the changelog but there was no reference to “nosum”, so I can’t see if this was done in some developmental or newer version.

The performance implication of this is that I have to set checksum => mtime as the least expensive thing after nothing, but on servers with tonnes of files, this means statting thousands or millions of files for mtime!!!

Updated by Brice Figureau almost 3 years ago

harisekhon wrote:

The documentation says you can use checksum => nosum but when I try this I got

Failed to retrieve current state of resource: Invalid checksum type nosum

Puppet 0.24.7

Surely nosum should be valid?

I went through the changelog but there was no reference to “nosum”, so I can’t see if this was done in some developmental or newer version.

Use checksum => undef for the same effect as nosum.

Updated by Luke Kanies almost 3 years ago

harisekhon wrote:

The documentation says you can use checksum => nosum but when I try this I got

Failed to retrieve current state of resource: Invalid checksum type nosum

Puppet 0.24.7

Surely nosum should be valid?

I went through the changelog but there was no reference to “nosum”, so I can’t see if this was done in some developmental or newer version.

The performance implication of this is that I have to set checksum => mtime as the least expensive thing after nothing, but on servers with tonnes of files, this means statting thousands or millions of files for mtime!!!

The ‘nosum’ thing is an artifact of internals that don’t exist any more and I just haven’t removed it yet.

I’ll remove it, but it’s unrelated to your problem.

Updated by Trevor Vaughan almost 3 years ago

Have you tried ‘nice'ing puppet to 19?

This will at least keep it from interfering as much with other CPU intensive processes though it will cause it to take much longer to run.

On a different note, I believe the default checksum is ‘md5’; should this be changed to mtime, or something less heavy? Many people seem to be running into inadvertent problems with md5 due to large recursive directories with potentially large files.

Updated by Hari Sekhon almost 3 years ago

Nicing puppet doesn’t stop it from ultimately consuming the same resources which is really too much.

I already have changed the default summing, and only reactivated md5 on non-recursive config files as a performance workaround.

However, md5 is necessary for config as a lot of my configs are regenerated frequently on my puppetmaster so they would be reapplied constantly without a checksum as the timestamps always change.

Updated by James Turnbull over 2 years ago

Have you tried 0.25.1 as an alternative to one of the 0.24.x releases?

Updated by Hari Sekhon over 2 years ago

I did tweak things at the time as well as re-writing thousands of lines of my puppet code to replace the slower bits and also had puppet auto update itself to the latest version but since this was many months ago I’ve in-between changed company to one I much prefer.. so it’s time to build a new puppet infrastructure again and then I’ll see if performance is better… although it won’t be a fair comparison anyway… but I can hope.

Updated by James Turnbull over 1 year ago

  • Status changed from Needs More Information to Closed

Also available in: Atom PDF