The Puppet Labs Issue Tracker has Moved: https://tickets.puppetlabs.com

Bug #2211

Facter timeouts reap all subprocesses thus confusing Puppet

Added by John Florian over 5 years ago. Updated 11 months ago.

Status:AcceptedStart date:04/29/2009
Priority:HighDue date:
Assignee:-% Done:

0%

Category:library
Target version:-
Keywords: Affected Facter version:
Branch:

We've Moved!

Ticket tracking is now hosted in JIRA: https://tickets.puppetlabs.com

This ticket is now tracked at: https://tickets.puppetlabs.com/browse/FACT-150


Description

It is no longer possible to have puppet install packages via yum/rpm if the network interface is not bound to an IP address. Our use case requires using puppet in the non-daemon mode and this is possible for us because the system will have all necessary manifests and other necessary files locally. This worked just fine with 0.24.6 on Fedora 10, but began failing upon the upgrade to 0.24.8.

See the attachments for failure messages and a code diff that seems to have introduced the regression. If I revert this one change, things work nicely once again. Looks like a very simple fix if it weren’t for the ominous looking comment in the code. :–)

package-installation-failure-log - shows error messages for failed rpm package installation (1.56 KB) John Florian, 04/29/2009 01:18 pm

autoload-regression-diff - shows diff that introduced regression (524 Bytes) John Florian, 04/29/2009 01:18 pm

error-with-backtrace - shows back trace when waitpid2 fails after executing '/bin/rpm --version' (32.8 KB) John Florian, 05/01/2009 12:50 pm

sigchld.patch Magnifier (356 Bytes) John Florian, 05/01/2009 03:47 pm

0001-Patch-Facter-is-collecting-zombies-by-PID-after-time.patch Magnifier (4.99 KB) Daniel Grafe, 04/16/2012 01:44 am


Related issues

Related to Puppet - Bug #8239: zombie getconf processes, possibly from facter Accepted 07/06/2011

History

#1 Updated by James Turnbull over 5 years ago

  • Status changed from Unreviewed to Needs Decision
  • Assignee set to Luke Kanies
  • Target version set to 0.25.0

#2 Updated by John Florian over 5 years ago

Just to clarify: Our use case requires the non-daemon mode and for package installation to succeed despite a working network connection because puppet is running on hosts that can be isolated from the network for periods of time. These systems run a read-only, Live OS image from CompactFlash and thus loose all the work puppet did done previously, if they get rebooted. The idea is that if the host has everything local, puppet can reconfigure the host to the last known setup.

This used to work great, but does not now.

#3 Updated by Luke Kanies over 5 years ago

  • Status changed from Needs Decision to Needs More Information

Can you provide a bit more information? What failure do you get without a network connection?

I’m pretty stumped as to how this line could affect your ability to install packages, but if it is, yeah, we should fix it.

#4 Updated by John Florian over 5 years ago

Luke, Please see the new attachment (error-with-backtrace). I revised the exception handler at this key point to have it print a backtrace. I have been able to trace this down to where puppet is trying to determine what package managers are available and one of the tests it runs is ‘/bin/rpm —version’, which makes sense as a test. I followed this problem down further to where it calls Process.waitpid2 to see if that command succeeded or not and it appears like it begins waiting too soon. In other words, it’s as if the process hasn’t been spawned yet, so waitpid2 raises an error essentially saying there is no such process. I’m not sure how that would be possible though or why having an IP address bound to the interface makes any difference. We started strace'ing the whole mess and oddly enough things work then, but we suspect that is because strace is slowing things down enough that the PID does exist by the time waitpid2 is called.

I’ll keep digging and post any findings.

#5 Updated by Luke Kanies over 5 years ago

  • Status changed from Needs More Information to Accepted
  • Target version changed from 0.25.0 to 4

No way this is making it into 0.25, I think. When code shows up, we’ll schedule it.

#6 Updated by John Florian over 5 years ago

Great news! My boss lent a hand on this and I think we have a fix. While I’m sure the attached patch we made is not necessarily the right place to add this trap, it does resolve the problem. This note from “man 2 waitpid” explains his reasoning for the patch.

POSIX.1-2001  specifies that if the disposition of SIGCHLD is set to SIG_IGN or
the SA_NOCLDWAIT flag is set for SIGCHLD (see sigaction(2)), then children that
terminate  do  not  become zombies and a call to wait() or waitpid() will block
until all children have terminated, and then fail with  errno  set  to  ECHILD.
(The  original  POSIX  standard left the behavior of setting SIGCHLD to SIG_IGN
unspecified.  Note that even though  the  default  disposition  of  SIGCHLD  is
"ignore",  explicitly  setting  the disposition to SIG_IGN results in different
treatment of zombie process children.)  Linux 2.6 conforms to  this  specifica-
tion.  However, Linux 2.4 (and earlier) does not: if a wait() or waitpid() call
is made while SIGCHLD is being ignored, the call behaves just as though SIGCHLD
were  not  being  ignored, that is, the call blocks until the next child termi-
nates and then returns the process ID and status of that child.

#7 Updated by John Florian over 5 years ago

  • File sigchld.patch added

Doh! Here’s the patch.

#8 Updated by Luke Kanies over 5 years ago

  • Target version changed from 4 to 0.25.0

#9 Updated by John Florian over 5 years ago

Great news! I now know how you can reproduce this and exactly what the problem is. Please disregard the previous patch.

The fault actually lies with Facter::Util::Resolution in the “value” function when there is a timeout for a sub-process. The code tries to reap the zombie sub-process that timed-out, but doesn’t necessarily reap just its own (see facter/util/resolution.rb:128 and the Process.waitall). What I see is effectively a race condition between the puppet thread that is looking for a package provider (spawns a thread for ‘rpm —version’) and this zombie reaper in facter (spawns a thread to Process.waitall). When facter launches a sub-process for ‘host #{hostname}’ and the network interface is down, the host command takes about 10 seconds to timeout. Facter sees this is taking too long and goes to reap it as a zombie. Unfortunately Process.waitall also happens to reap the sub-process for ‘rpm —version’ and thus steals that exit code away from puppet and BOOM the whole thing goes ugly quickly after that.

Please let me know if you need more info on how to reproduce this, but I suspect you should have no difficulty at this point. It looks like the proper solution might involve having Facter::Util::Resolution.exec() also return the child PID so that the zombie reaper can wait for that PID specifically rather than all PIDs.

#10 Updated by Luke Kanies over 5 years ago

This is, um, amazing. Very well done in tracking that down.

What’s a reasonable fix? Do we have to switch Facter.exec to use a fork/exec explicitly so that can always get the child process IDs?

#11 Updated by John Florian over 5 years ago

Luke Kanies wrote:

This is, um, amazing. Very well done in tracking that down.

You can blame my gifted boss for that one. :–) I wish I could take credit for it though. It was really fun when my light bulb came on after he explained to me what was happening.

What’s a reasonable fix? Do we have to switch Facter.exec to use a fork/exec explicitly so that can always get the child process IDs?

I think that’s exactly what it needs. I’ve been trying to come up with a patch since this is a blocker for my own project which will most likely be released before 0.25 gets out (and Fedora catches up). As an aside, any idea when 0.25 will be released?

Since this is my first swim in Ruby waters, I’ve had to do lots of reading. After finding this nice table: http://whynotwiki.com/Ruby/Process_management#How_do_I_execute_an_external_program.3F I concluded that popen might work best here so that we can get the stdout and the PID and then return both. Do we need to capture stderr too?

#12 Updated by John Florian over 5 years ago

Just wanted to mention that I spent too much time yesterday trying to come up with a patch for this. Ruby has just too many subtle “features” for me (a Ruby newb) to get a working patch here with a reasonable effort. So instead, I thought I’d share some findings.

Facter.exec cannot just return the output and PID together since if the read on the popen pipe blocks and the call to exec never returns and consequently Facter.value won’t know what PID to reap. This is exactly the case we’re trying to fix.

You could make the PID an instance variable so that Facter.exec can set it before reading from the pipe. That way Facter.value could access the correct PID value. (I tried this, but stumbled on Ruby’s variable scoping and what may be pubic/private/static methods — I bet it would be easy for someone fluent in Ruby though.)

You could also just inline the Facter.exec code within Facter.value.

#13 Updated by Luke Kanies over 5 years ago

  • Subject changed from puppet won't install packages if network interface does not have an IP address bound to Facter timeouts reap all subprocesses thus confusing Puppet

This is really a Facter bug, right? There’s nothing we can do in Puppet to fix it.

How do we switch the project?

#14 Updated by Luke Kanies over 5 years ago

  • Project changed from Puppet to Facter
  • Target version deleted (0.25.0)

#15 Updated by James Turnbull over 5 years ago

  • Category set to library
  • Target version set to 1.6.0

#16 Updated by Paul Nasrat over 5 years ago

I’m happy to look into this.

#17 Updated by Marcin Owsiany about 5 years ago

I first noticed the problem when working with puppet on a train, without any internet connection. In such situation, two problems appear:

1) Mildly annoying, frequent 2-second pauses followed by:

      Timed out seeking value for ipaddress
I think this is because when I have no network connection, there is
no nameserver in /etc/resolv.conf, and despite my hostname being in
/etc/hosts, some resolver library does stupid things (I think it
tries to contact a purely imaginary nameserver running on localhost,
but I haven't thorougly investigated that).  This is a more general
system issue, outside of puppet, as "host `hostname`" also hangs for
5 seconds before returning:
      ;; connection timed out; no servers could be reached

2) More seriously, most – if not all – resoruces fail with error such as:

      err: //beczulka/common/common_tweaks/common_bash_rc/File[/root/.bashrc]: Failed
      to retrieve current state of resource: No child processes
Adding --trace on the command line shows for example:
        /usr/lib/ruby/1.8/puppet/util.rb:290:in `waitpid2'
        /usr/lib/ruby/1.8/puppet/util.rb:290:in `execute'
        /usr/lib/ruby/1.8/puppet/util/diff.rb:12:in `diff'
        /usr/lib/ruby/1.8/puppet/util/diff.rb:67:in `string_file_diff'
        /usr/lib/ruby/1.8/puppet/type/file/source.rb:163:in `insync?'
        /usr/lib/ruby/1.8/puppet/type.rb:965:in `propertychanges'
        /usr/lib/ruby/1.8/puppet/util/posix.rb:156:in `find_all'
        /usr/lib/ruby/1.8/puppet/type.rb:963:in `each'
        /usr/lib/ruby/1.8/puppet/type.rb:963:in `find_all'
        /usr/lib/ruby/1.8/puppet/type.rb:963:in `propertychanges'
        /usr/lib/ruby/1.8/puppet/type/file.rb:1123:in `propertychanges'
        /usr/lib/ruby/1.8/puppet/type.rb:833:in `evaluate'
        /usr/lib/ruby/1.8/puppet/transaction.rb:60:in `apply'
        /usr/lib/ruby/1.8/puppet/transaction.rb:239:in `eval_resource'
        /usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
        /usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
        /usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
        /usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
        /usr/lib/ruby/1.8/puppet/transaction.rb:238:in `eval_resource'
        /usr/lib/ruby/1.8/puppet/transaction.rb:310:in `evaluate'
        /usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
        /usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
        /usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
        /usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
        /usr/lib/ruby/1.8/puppet/transaction.rb:309:in `evaluate'
        /usr/lib/ruby/1.8/puppet/transaction.rb:303:in `collect'
        /usr/lib/ruby/1.8/puppet/transaction.rb:303:in `evaluate'
        /usr/lib/ruby/1.8/puppet/node/catalog.rb:124:in `apply'
        /usr/bin/puppet:233

Now, when I do have internet connection, things are working just fine, so there must be some connection between (1) and (2). However ignoring it for a while, I tried to strace puppet to see what happens around before the wait() call. This was a bit hard. Stracing all syscalls takes a huge amount of time and produces massive output. So in the end, after ignoring some syscalls with “-e ‘!nanosleep,gettimeofday,sigprocmask’” I managed to catch puppet hanging forever after executing the following (line numbers prepended for easy commenting later):

     1  10764 pipe([5, 7])                      = 0
     2  10764 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cf0b18) = 10833
     3  10833 close(5 
     4  10764 close(7 
     5  10833 <... close resumed> )             = 0
     6  10764 <... close resumed> )             = 0
     7  10833 dup2(7, 1 
     8  10764 fcntl64(5, F_GETFL 
     9  10833 <... dup2 resumed> )              = 1
    10  10764 <... fcntl64 resumed> )           = 0 (flags O_RDONLY)
    11  10833 close(7 
    12  10764 fstat64(5,  
    13  10833 <... close resumed> )             = 0
    14  10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    15  10833 close(3 
    16  10764 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 
    17  10833 <... close resumed> )             = 0
    18  10764 <... mmap2 resumed> )             = 0xb7843000
    19  10833 close(4 
    20  10764 _llseek(5, 0,  
    21  10833 <... close resumed> )             = 0
    22  10764 <... _llseek resumed> 0xbf7f5270, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    23  10833 close(5 
    24  10764 fstat64(5,  
    25  10833 <... close resumed> )             = -1 EBADF (Bad file descriptor)
    26  10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    27  10833 close(6)                          = 0
    28  10833 close(7 
    29  10764 select(6, [5], [], [], {0, 16463} 
    30  10833 <... close resumed> )             = -1 EBADF (Bad file descriptor)
    31  10833 close(8)                          = -1 EBADF (Bad file descriptor)
    32  10833 close(9)                          = -1 EBADF (Bad file descriptor)
    33  10833 close(10)                         = -1 EBADF (Bad file descriptor)
    34  10833 close(11)                         = -1 EBADF (Bad file descriptor)
    35  [...]
    36  10833 close(253)                        = -1 EBADF (Bad file descriptor)
    37  10833 close(254)                        = -1 EBADF (Bad file descriptor)
    38  10833 close(255)                        = -1 EBADF (Bad file descriptor)
    39  10833 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars */]) = 0
    40  10833 brk(0)                            = 0x8058000
    41  10833 uname({sys="Linux", node="beczulka", ...}) = 0
    42  [...]
    43  10833 write(1, "eth0      Link encap:Ethernet  H"..., 378 
    44  10764 <... select resumed> )            = 1 (in [5], left {0, 40000})
    45  10833 <... write resumed> )             = 378
    46  10833 exit_group(0)                     = ?
    47  10764 --- SIGCHLD (Child exited) @ 0 (0) ---
    48  10764 select(6, [5], [], [], {0, 44385}) = 1 (in [5], left {0, 44385})
    49  10764 read(5, "eth0      Link encap:Ethernet  H"..., 1024) = 378
    50  10764 select(6, [5], [], [], {0, 44025}) = 1 (in [5], left {0, 44025})
    51  10764 read(5, "", 1024)                 = 0
    52  10764 close(5)                          = 0
    53  10764 munmap(0xb7843000, 4096)          = 0
    54  10764 rt_sigaction(SIGHUP, {SIG_IGN}, {0xb7f438f0, [HUP], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
    55  10764 rt_sigaction(SIGQUIT, {SIG_IGN}, {0xb7f438f0, [QUIT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
    56  10764 rt_sigaction(SIGINT, {SIG_IGN}, {0xb7f438f0, [INT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
    57  10764 waitpid(10833, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 10833
    58  10764 rt_sigaction(SIGHUP, {0xb7f438f0, [HUP], SA_RESTORER|SA_RESTART, 0xb7d1c128}, {SIG_IGN}, 8) = 0
    59  10764 rt_sigaction(SIGQUIT, {0xb7f438f0, [QUIT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, {SIG_IGN}, 8) = 0
    60  10764 rt_sigaction(SIGINT, {0xb7f438f0, [INT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, {SIG_IGN}, 8) = 0
    61  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
    62  10764 pipe([5, 7])                      = 0
    63  10764 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cf0b18) = 10834
    64  10834 close(5 
    65  10764 close(7 
    66  10834 <... close resumed> )             = 0
    67  10764 <... close resumed> )             = 0
    68  10834 dup2(7, 1 
    69  10764 fcntl64(5, F_GETFL 
    70  10834 <... dup2 resumed> )              = 1
    71  10764 <... fcntl64 resumed> )           = 0 (flags O_RDONLY)
    72  10834 close(7 
    73  10764 fstat64(5,  
    74  10834 <... close resumed> )             = 0
    75  10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    76  10834 close(3 
    77  10764 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 
    78  10834 <... close resumed> )             = 0
    79  10764 <... mmap2 resumed> )             = 0xb7843000
    80  10834 close(4 
    81  10764 _llseek(5, 0,  
    82  10834 <... close resumed> )             = 0
    83  10764 <... _llseek resumed> 0xbf7f6510, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    84  10834 close(5 
    85  10764 fstat64(5,  
    86  10834 <... close resumed> )             = -1 EBADF (Bad file descriptor)
    87  10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    88  10834 close(6)                          = 0
    89  10834 close(7 
    90  10764 select(6, [5], [], [], {0, 32148} 
    91  10834 <... close resumed> )             = -1 EBADF (Bad file descriptor)
    92  10834 close(8)                          = -1 EBADF (Bad file descriptor)
    93  10834 close(9)                          = -1 EBADF (Bad file descriptor)
    94  10834 close(10)                         = -1 EBADF (Bad file descriptor)
    95  [...]
    96  10834 close(253)                        = -1 EBADF (Bad file descriptor)
    97  10834 close(254)                        = -1 EBADF (Bad file descriptor)
    98  10834 close(255)                        = -1 EBADF (Bad file descriptor)
    99  10834 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
   100  10834 brk(0)                            = 0x8058000
   101  10834 uname({sys="Linux", node="beczulka", ...}) = 0
   102  10834 write(1, "lo        Link encap:Local Loopb"..., 781 
   103  10764 <... select resumed> )            = 1 (in [5], left {0, 0})
   104  10834 <... write resumed> )             = 781
   105  10834 exit_group(0)                     = ?
   106  10764 --- SIGCHLD (Child exited) @ 0 (0) ---
   107  10764 select(6, [5], [], [], {0, 0})    = 1 (in [5], left {0, 0})
   108  10764 waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 10834
   109  10764 waitpid(-1, 0xbf7ee868, WNOHANG)  = 0
   110  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   111  10764 read(5, "lo        Link encap:Local Loopb"..., 1024) = 781
   112  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   113  10764 select(6, [5], [], [], {0, 59201}) = 1 (in [5], left {0, 59201})
   114  10764 read(5, "", 1024)                 = 0
   115  10764 close(5)                          = 0
   116  10764 munmap(0xb7843000, 4096)          = 0
   117  10764 rt_sigaction(SIGHUP, {SIG_IGN}, {0xb7f438f0, [HUP], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
   118  10764 rt_sigaction(SIGQUIT, {SIG_IGN}, {0xb7f438f0, [QUIT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
   119  10764 rt_sigaction(SIGINT, {SIG_IGN}, {0xb7f438f0, [INT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
   120  10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
   121  10764 select(0, [], [], [], {0, 57695}) = 0 (Timeout)
   122  10764 select(0, [], [], [], {0, 4040})  = 0 (Timeout)
   123  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   124  10764 waitpid(-1, 0xbf7ee868, WNOHANG)  = 0
   125  10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
   126  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   127  10764 select(0, [], [], [], {0, 59357}) = 0 (Timeout)
   128  10764 select(0, [], [], [], {0, 920})   = 0 (Timeout)
   129  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   130  10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
   131  10764 waitpid(-1, 0xbf7ee868, WNOHANG)  = 0
   132  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   133  10764 select(0, [], [], [], {0, 59303}) = 0 (Timeout)
   134  10764 select(0, [], [], [], {0, 901})   = 0 (Timeout)
   135  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   136  10764 waitpid(-1, 0xbf7ee868, WNOHANG)  = 0
   137  10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
   138  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   139  10764 select(0, [], [], [], {0, 59654}) = 0 (Timeout)
        [...] goes on forever

My comments: * lines 1-38 is basic pipe setup/fork/close all FDs * 39-46: “ifconfig wlan0” does its thing * 47-60: puppet reads ifconfig’s output first, and then reaps it

So far, so good – all is well. But now notice what happens on the following lines: the command this time is just “ifconfig”, and all looks similar until line 106. However then: * line 107: there the timeout in select() is zero, * line 108: the child is reaped straight away, * line 111: only now the output is read, * line 120: puppet makes an attempt to reap the same child again!

The circumstances are slightly different, because in the first case ECHILD is returned and reported, but in the second case, no error is returned, and puppet waits forever for a child with PID 10834 to exit, which will never happen.

However I think that the reason for a race might be the same – some generic SIGCHLD handler that comes in and reaps the child before some other code can call wait()?

FYI grepping through the log reveals that the following execve()s were called before it hung:

10766 execve("/usr/bin/which", ["which", "which"], [/* 21 vars */]) = 0
10768 execve("/usr/bin/which", ["which", "uname"], [/* 21 vars */]) = 0
10769 execve("/bin/uname", ["uname", "-s"], [/* 18 vars */]) = 0
10771 execve("/usr/bin/which", ["which", "uname"], [/* 21 vars */]) = 0
10772 execve("/bin/uname", ["uname", "-r"], [/* 18 vars */]) = 0
10774 execve("/usr/bin/which", commit:"which", "hostname"], [/* 21 vars */) = 0
10775 execve("/bin/hostname", ["hostname"], [/* 18 vars */]) = 0
10777 execve("/usr/bin/which", ["which", "dnsdomainname"], [/* 21 vars */]) = 0
10778 execve("/bin/dnsdomainname", ["dnsdomainname"], [/* 18 vars */]) = 0
10780 execve("/usr/bin/which", ["which", "domainname"], [/* 21 vars */]) = 0
10782 execve("/usr/bin/which", commit:"which", "sendmail"], [/* 21 vars */) = 0
10784 execve("/usr/bin/which", ["which", "dnsdomainname"], [/* 21 vars */]) = 0
10785 execve("/bin/dnsdomainname", ["dnsdomainname"], [/* 18 vars */]) = 0
10787 execve("/usr/bin/which", ["which", "domainname"], [/* 21 vars */]) = 0
10789 execve("/usr/bin/which", ["which", "lsb_release"], [/* 21 vars */]) = 0
10791 execve("/usr/bin/lsb_release", commit:"lsb_release", "-a"], [/* 21 vars */) = 0
10792 <... execve resumed> )            = 0
10793 execve("/usr/bin/apt-cache", ["apt-cache", "policy"], [/* 21 vars */]) = 0
10796 <... execve resumed> )            = 0
10797 execve("/usr/bin/dpkg-query", ["dpkg-query", "-f", "${Version} ${Provides}\n", "-W", "lsb-core", "lsb-cxx", "lsb-graphics", "lsb-desktop", "lsb-qt4", "lsb-languages", "lsb-multimedia",
+"lsb-printing"], [/* 21 vars */]) = 0
10802 execve("/sbin/ifconfig", ["/sbin/ifconfig", "-a"], [/* 18 vars */]) = 0
10804 execve("/usr/sbin/dmidecode", commit:"/usr/sbin/dmidecode"], [/* 21 vars */) = 0
10805 execve("/sbin/ifconfig", ["/sbin/ifconfig", "-a"], [/* 18 vars */]) = 0
10806 execve("/sbin/ip", ["/sbin/ip", "link", "show", "wmaster0"], [/* 18 vars */]) = 0
10807 execve("/sbin/ifconfig", ["/sbin/ifconfig", "wmaster0"], [/* 18 vars */]) = 0
10808 execve("/sbin/ip", ["/sbin/ip", "link", "show", "eth0"], [/* 18 vars */]) = 0
10809 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars */]) = 0
10811 execve("/usr/bin/which", ["which", "uptime"], [/* 21 vars */]) = 0
10812 execve("/usr/bin/uptime", ["uptime"], [/* 18 vars */]) = 0
10813 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
10816 execve("/usr/bin/which", ["which", "host"], [/* 21 vars */]) = 0
10817 execve("/usr/bin/host", ["host", "beczulka"], [/* 18 vars */]) = 0
10821 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
10823 execve("/usr/bin/which", ["which", "grep"], [/* 21 vars */]) = 0
10826 execve("/bin/grep", ["grep", "physical id", "/proc/cpuinfo"], [/* 21 vars */] 
10826 <... execve resumed> )            = 0
10827 execve("/usr/bin/cut", ["cut", "-d:", "-f", "2"], [/* 21 vars */] 
10827 <... execve resumed> )            = 0
10828 execve("/usr/bin/sort", ["sort", "-u"], [/* 21 vars */] 
10828 <... execve resumed> )            = 0
10829 execve("/usr/bin/wc", ["wc", "-l"], [/* 21 vars */] 
10829 <... execve resumed> )            = 0
10830 execve("/sbin/ip", ["/sbin/ip", "link", "show", "wlan0"], [/* 18 vars */]) = 0
10831 execve("/sbin/ifconfig", ["/sbin/ifconfig", "wlan0"], [/* 18 vars */]) = 0
10832 execve("/sbin/ip", ["/sbin/ip", "link", "show", "eth0"], [/* 18 vars */]) = 0
10833 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars */]) = 0
10834 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0

This is puppet 0.24.8-1gg4 (the differences between this and Debian 0.24.8-1 only seem to touch packaging).

#18 Updated by Marcin Owsiany about 5 years ago

Here’s a minimal test case I could come up with, 100% repeatable for me:

$ touch /tmp/a
$ echo 'file { "/tmp/a": content => "a" }' > a.pp
$ sudo puppet --noop --debug --trace a.pp
debug: Creating default schedules
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: File does not exist
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: Changing ensure
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: 1 change(s)
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]/ensure: is absent, should be directory (noop)
debug: Finishing transaction -607699778 with 1 changes
debug: //File[/tmp/a]/checksum: Initializing checksum hash
debug: //File[/tmp/a]: Creating checksum {md5}d41d8cd98f00b204e9800998ecf8427e
debug: //File[/tmp/a]/content: Executing 'diff /tmp/a /tmp/puppet-diffing.21138.0'
0a1
> a
\ No newline at end of file
debug: //File[/tmp/a]: Changing content
debug: //File[/tmp/a]: 1 change(s)
notice: //File[/tmp/a]/content: is {md5}d41d8cd98f00b204e9800998ecf8427e, should be {md5}0cc175b9c0f1b6a831c399e269772661 (noop)
debug: Finishing transaction -606708568 with 1 changes
  • so far OK.
  • now shut down the network
$ sudo puppet --noop --debug --trace a.pp
Timed out seeking value for ipaddress
Timed out seeking value for ipaddress
Timed out seeking value for ipaddress
Timed out seeking value for ipaddress
warning: Could not retrieve fact ipaddress
warning: Could not retrieve fact fqdn
debug: Creating default schedules
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: File does not exist
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: Changing ensure
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: 1 change(s)
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]/ensure: is absent, should be directory (noop)
debug: Finishing transaction -606479048 with 1 changes
debug: //File[/tmp/a]/checksum: Initializing checksum hash
debug: //File[/tmp/a]: Creating checksum {md5}d41d8cd98f00b204e9800998ecf8427e
debug: //File[/tmp/a]/content: Executing 'diff /tmp/a /tmp/puppet-diffing.21345.0'
/usr/lib/ruby/1.8/puppet/util.rb:290:in `waitpid2'
/usr/lib/ruby/1.8/puppet/util.rb:290:in `execute'
/usr/lib/ruby/1.8/puppet/util/diff.rb:12:in `diff'
/usr/lib/ruby/1.8/puppet/util/diff.rb:67:in `string_file_diff'
/usr/lib/ruby/1.8/puppet/type/file/content.rb:48:in `insync?'
/usr/lib/ruby/1.8/puppet/type.rb:965:in `propertychanges'
/usr/lib/ruby/1.8/puppet/util/checksums.rb:17:in `find_all'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `each'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `find_all'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `propertychanges'
/usr/lib/ruby/1.8/puppet/type/file.rb:1123:in `propertychanges'
/usr/lib/ruby/1.8/puppet/type.rb:833:in `evaluate'
/usr/lib/ruby/1.8/puppet/transaction.rb:60:in `apply'
/usr/lib/ruby/1.8/puppet/transaction.rb:239:in `eval_resource'
/usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
/usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
/usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
/usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
/usr/lib/ruby/1.8/puppet/transaction.rb:238:in `eval_resource'
/usr/lib/ruby/1.8/puppet/transaction.rb:310:in `evaluate'
/usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
/usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
/usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
/usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
/usr/lib/ruby/1.8/puppet/transaction.rb:309:in `evaluate'
/usr/lib/ruby/1.8/puppet/transaction.rb:303:in `collect'
/usr/lib/ruby/1.8/puppet/transaction.rb:303:in `evaluate'
/usr/lib/ruby/1.8/puppet/node/catalog.rb:124:in `apply'
/usr/bin/puppet:233
err: //File[/tmp/a]: Failed to retrieve current state of resource: No child processes
debug: Finishing transaction -613159068 with 0 changes

#19 Updated by Luke Kanies over 4 years ago

  • Assignee changed from Luke Kanies to Markus Roberts

#20 Updated by Matt Robinson almost 4 years ago

Pairing with William just now this doesn’t seem to be an issue anymore… Not sure why.

#21 Updated by Paul Nasrat almost 4 years ago

Matt/William – could you be a bit more specific on what you tested, on what OS/versions. The behaviour is still clearly wrong in facter however it may be fiddly to reproduce.

#22 Updated by Michael Scherer over 3 years ago

I can replicate this when puppet is run for some time ( restarting usually make the problem disappear, but it come back later ). I run it on 2 mandriva 2010.2 server, one in a vm, another one on a quite loaded server ( iso creation ). On one of the server that I tried to inspect, there was 2 puppet process, one of them being waiting on exec(‘hostname’) to return.

# rpm -q puppet facter
puppet-2.6.2-1mdv2010.1
facter-1.5.7-1mdv2010.0

I can update to latest stable release if needed.

#23 Updated by Michael Scherer over 3 years ago

I started to notice this on a 3rd server, and the issue occurs ( thanks to nagios to monitor it ) around 4h in the morning, just when the logrotate job is running. So I suspect the issue is linked to the load and I/O. While this will not be a proper fix for the core issue, I will see if the patch I sent for 7138 ( for not using exec for hostname ) lessen the problem for me.

#24 Updated by Michael Stahnke over 3 years ago

  • Target version changed from 1.6.0 to 144

#25 Updated by Adrien Thebo about 3 years ago

  • Assignee changed from Markus Roberts to Adrien Thebo

#26 Updated by Adrien Thebo almost 3 years ago

  • File deleted (sigchld.patch)

#27 Updated by Adrien Thebo almost 3 years ago

  • File deleted (sigchld.patch)

#28 Updated by John Florian over 2 years ago

Matt Robinson wrote:

Pairing with William just now this doesn’t seem to be an issue anymore… Not sure why.

I have to disagree. I am still seeing this as of:

facter-1.6.6-1.fc16.i686 puppet-2.6.14-1.fc16.noarch

I’m afraid I’m losing confidence in puppet/facter as a technology if such a critical issue hasn’t been tackled in 3 years now, especially given the near solution I provided early on. I don’t mean to disparage anyone who’s worked on this issue — I truly appreciate the efforts — but I would really expected to have a solution shortly after I made comment 12. I really don’t want to learn ruby too, I’ve enough on my plate already. Nor do I really relish the idea of abandoning puppet after I’ve given it this much commitment, but at some point I must start considering other options here.

#29 Updated by Daniel Grafe over 2 years ago

We were also facing this issue. Since we are initially setting up our machines locally with puppet agent we don’t have a working network infrastructure (DNS resolving, interfaces, static host file…). This is causing the host tool to run for a longer time than the fact timeout allows and the greedy “block on waitall” thread is created. This thread should collect only Zombies but it is also collecting the child processes from the following puppet run. Puppet then fails on waitpid2 and terminates with an exception.

Please find attached our patch providing a workaround. It is basically realizing the exec after forking so we have a pid. The pid is saved in a class variable (big ugly hack!) so we can use this for waitpid in the case of a timeout.

This workaround is probably not the final solution for this issue and the exec for windows is still untested since we’re running on linux only.

#30 Updated by Adrien Thebo over 2 years ago

  • Assignee deleted (Adrien Thebo)

#31 Updated by Jeff Weiss over 2 years ago

  • Target version deleted (144)

#32 Updated by John Florian over 2 years ago

This dangerous bug is still occurring with:

puppet-2.7.13-1.fc17 facter-1.6.6-1.fc17 ruby-1.9.3.194-11.fc17

This will commonly look something like this now:

Jun 14 14:45:08 puppet puppet-master[9646]: (//myhost.example.com/Puppet) Could not retrieve catalog from remote server: Could not intern from pson: Could not autoload package: Could not autoload /usr/share/ruby/vendor_ruby/puppet/provider/package/aptrpm.rb: Could not autoload rpm: No child processes
Jun 14 14:45:08 puppet puppet-master[9646]: (//myhost.example.com/Puppet) Using cached catalog
Jun 14 14:45:08 puppet puppet-master[9646]: (//myhost.example.com/Puppet) Failed to apply catalog: Parameter ensure failed: Validate method failed for class ensure: undefined method `satisfies?' for nil:NilClass

At least that’s the case when it kills one of its own children. I have to wonder how often a critical server PID gets killed instead, which would is just the opposite of what you’d want puppet to do in an enterprise.

#33 Updated by Luke Kanies 11 months ago

Redmine Issue #2211 has been migrated to JIRA:

https://tickets.puppetlabs.com/browse/FACT-150

Also available in: Atom PDF