Bug #2211
Facter timeouts reap all subprocesses thus confusing Puppet
| Status: | Accepted | Start date: | 04/29/2009 | |
|---|---|---|---|---|
| Priority: | High | Due date: | ||
| Assignee: | - | % Done: | 0% |
|
| Category: | library | |||
| Target version: | 1.6.x | |||
| Keywords: | Affected Facter version: | |||
| Branch: | ||||
| Votes: | 5 |
Description
It is no longer possible to have puppet install packages via yum/rpm if the network interface is not bound to an IP address. Our use case requires using puppet in the non-daemon mode and this is possible for us because the system will have all necessary manifests and other necessary files locally. This worked just fine with 0.24.6 on Fedora 10, but began failing upon the upgrade to 0.24.8.
See the attachments for failure messages and a code diff that seems to have introduced the regression. If I revert this one change, things work nicely once again. Looks like a very simple fix if it weren’t for the ominous looking comment in the code. :–)
Related issues
History
Updated by James Turnbull about 3 years ago
- Status changed from Unreviewed to Needs Decision
- Assignee set to Luke Kanies
- Target version set to 0.25.0
Updated by John Florian about 3 years ago
Just to clarify: Our use case requires the non-daemon mode and for package installation to succeed despite a working network connection because puppet is running on hosts that can be isolated from the network for periods of time. These systems run a read-only, Live OS image from CompactFlash and thus loose all the work puppet did done previously, if they get rebooted. The idea is that if the host has everything local, puppet can reconfigure the host to the last known setup.
This used to work great, but does not now.
Updated by Luke Kanies about 3 years ago
- Status changed from Needs Decision to Needs More Information
Can you provide a bit more information? What failure do you get without a network connection?
I’m pretty stumped as to how this line could affect your ability to install packages, but if it is, yeah, we should fix it.
Updated by John Florian about 3 years ago
- File error-with-backtrace added
Luke, Please see the new attachment (error-with-backtrace). I revised the exception handler at this key point to have it print a backtrace. I have been able to trace this down to where puppet is trying to determine what package managers are available and one of the tests it runs is ‘/bin/rpm —version’, which makes sense as a test. I followed this problem down further to where it calls Process.waitpid2 to see if that command succeeded or not and it appears like it begins waiting too soon. In other words, it’s as if the process hasn’t been spawned yet, so waitpid2 raises an error essentially saying there is no such process. I’m not sure how that would be possible though or why having an IP address bound to the interface makes any difference. We started strace'ing the whole mess and oddly enough things work then, but we suspect that is because strace is slowing things down enough that the PID does exist by the time waitpid2 is called.
I’ll keep digging and post any findings.
Updated by Luke Kanies about 3 years ago
- Status changed from Needs More Information to Accepted
- Target version changed from 0.25.0 to 4
No way this is making it into 0.25, I think. When code shows up, we’ll schedule it.
Updated by John Florian about 3 years ago
Great news! My boss lent a hand on this and I think we have a fix. While I’m sure the attached patch we made is not necessarily the right place to add this trap, it does resolve the problem. This note from “man 2 waitpid” explains his reasoning for the patch.
POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see sigaction(2)), then children that terminate do not become zombies and a call to wait() or waitpid() will block until all children have terminated, and then fail with errno set to ECHILD. (The original POSIX standard left the behavior of setting SIGCHLD to SIG_IGN unspecified. Note that even though the default disposition of SIGCHLD is "ignore", explicitly setting the disposition to SIG_IGN results in different treatment of zombie process children.) Linux 2.6 conforms to this specifica- tion. However, Linux 2.4 (and earlier) does not: if a wait() or waitpid() call is made while SIGCHLD is being ignored, the call behaves just as though SIGCHLD were not being ignored, that is, the call blocks until the next child termi- nates and then returns the process ID and status of that child.
Updated by Luke Kanies about 3 years ago
- Target version changed from 4 to 0.25.0
Updated by John Florian about 3 years ago
Great news! I now know how you can reproduce this and exactly what the problem is. Please disregard the previous patch.
The fault actually lies with Facter::Util::Resolution in the “value” function when there is a timeout for a sub-process. The code tries to reap the zombie sub-process that timed-out, but doesn’t necessarily reap just its own (see facter/util/resolution.rb:128 and the Process.waitall). What I see is effectively a race condition between the puppet thread that is looking for a package provider (spawns a thread for ‘rpm —version’) and this zombie reaper in facter (spawns a thread to Process.waitall). When facter launches a sub-process for ‘host #{hostname}’ and the network interface is down, the host command takes about 10 seconds to timeout. Facter sees this is taking too long and goes to reap it as a zombie. Unfortunately Process.waitall also happens to reap the sub-process for ‘rpm —version’ and thus steals that exit code away from puppet and BOOM the whole thing goes ugly quickly after that.
Please let me know if you need more info on how to reproduce this, but I suspect you should have no difficulty at this point. It looks like the proper solution might involve having Facter::Util::Resolution.exec() also return the child PID so that the zombie reaper can wait for that PID specifically rather than all PIDs.
Updated by Luke Kanies about 3 years ago
This is, um, amazing. Very well done in tracking that down.
What’s a reasonable fix? Do we have to switch Facter.exec to use a fork/exec explicitly so that can always get the child process IDs?
Updated by John Florian about 3 years ago
Luke Kanies wrote:
This is, um, amazing. Very well done in tracking that down.
You can blame my gifted boss for that one. :–) I wish I could take credit for it though. It was really fun when my light bulb came on after he explained to me what was happening.
What’s a reasonable fix? Do we have to switch Facter.exec to use a fork/exec explicitly so that can always get the child process IDs?
I think that’s exactly what it needs. I’ve been trying to come up with a patch since this is a blocker for my own project which will most likely be released before 0.25 gets out (and Fedora catches up). As an aside, any idea when 0.25 will be released?
Since this is my first swim in Ruby waters, I’ve had to do lots of reading. After finding this nice table: http://whynotwiki.com/Ruby/Process_management#How_do_I_execute_an_external_program.3F I concluded that popen might work best here so that we can get the stdout and the PID and then return both. Do we need to capture stderr too?
Updated by John Florian about 3 years ago
Just wanted to mention that I spent too much time yesterday trying to come up with a patch for this. Ruby has just too many subtle “features” for me (a Ruby newb) to get a working patch here with a reasonable effort. So instead, I thought I’d share some findings.
Facter.exec cannot just return the output and PID together since if the read on the popen pipe blocks and the call to exec never returns and consequently Facter.value won’t know what PID to reap. This is exactly the case we’re trying to fix.¶
You could make the PID an instance variable so that Facter.exec can set it before reading from the pipe. That way Facter.value could access the correct PID value. (I tried this, but stumbled on Ruby’s variable scoping and what may be pubic/private/static methods — I bet it would be easy for someone fluent in Ruby though.)¶
You could also just inline the Facter.exec code within Facter.value.¶
Updated by Luke Kanies almost 3 years ago
- Subject changed from puppet won't install packages if network interface does not have an IP address bound to Facter timeouts reap all subprocesses thus confusing Puppet
This is really a Facter bug, right? There’s nothing we can do in Puppet to fix it.
How do we switch the project?
Updated by Luke Kanies almost 3 years ago
- Project changed from Puppet to Facter
- Target version deleted (
0.25.0)
Updated by James Turnbull almost 3 years ago
- Category set to library
- Target version set to 1.6.0
Updated by Paul Nasrat almost 3 years ago
I’m happy to look into this.
Updated by Marcin Owsiany almost 3 years ago
I first noticed the problem when working with puppet on a train, without any internet connection. In such situation, two problems appear:
1) Mildly annoying, frequent 2-second pauses followed by:
Timed out seeking value for ipaddress
I think this is because when I have no network connection, there is
no nameserver in /etc/resolv.conf, and despite my hostname being in
/etc/hosts, some resolver library does stupid things (I think it
tries to contact a purely imaginary nameserver running on localhost,
but I haven't thorougly investigated that). This is a more general
system issue, outside of puppet, as "host `hostname`" also hangs for
5 seconds before returning:
;; connection timed out; no servers could be reached
2) More seriously, most – if not all – resoruces fail with error such as:
err: //beczulka/common/common_tweaks/common_bash_rc/File[/root/.bashrc]: Failed
to retrieve current state of resource: No child processes
Adding --trace on the command line shows for example:
/usr/lib/ruby/1.8/puppet/util.rb:290:in `waitpid2'
/usr/lib/ruby/1.8/puppet/util.rb:290:in `execute'
/usr/lib/ruby/1.8/puppet/util/diff.rb:12:in `diff'
/usr/lib/ruby/1.8/puppet/util/diff.rb:67:in `string_file_diff'
/usr/lib/ruby/1.8/puppet/type/file/source.rb:163:in `insync?'
/usr/lib/ruby/1.8/puppet/type.rb:965:in `propertychanges'
/usr/lib/ruby/1.8/puppet/util/posix.rb:156:in `find_all'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `each'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `find_all'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `propertychanges'
/usr/lib/ruby/1.8/puppet/type/file.rb:1123:in `propertychanges'
/usr/lib/ruby/1.8/puppet/type.rb:833:in `evaluate'
/usr/lib/ruby/1.8/puppet/transaction.rb:60:in `apply'
/usr/lib/ruby/1.8/puppet/transaction.rb:239:in `eval_resource'
/usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
/usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
/usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
/usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
/usr/lib/ruby/1.8/puppet/transaction.rb:238:in `eval_resource'
/usr/lib/ruby/1.8/puppet/transaction.rb:310:in `evaluate'
/usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
/usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
/usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
/usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
/usr/lib/ruby/1.8/puppet/transaction.rb:309:in `evaluate'
/usr/lib/ruby/1.8/puppet/transaction.rb:303:in `collect'
/usr/lib/ruby/1.8/puppet/transaction.rb:303:in `evaluate'
/usr/lib/ruby/1.8/puppet/node/catalog.rb:124:in `apply'
/usr/bin/puppet:233
Now, when I do have internet connection, things are working just fine, so there must be some connection between (1) and (2). However ignoring it for a while, I tried to strace puppet to see what happens around before the wait() call. This was a bit hard. Stracing all syscalls takes a huge amount of time and produces massive output. So in the end, after ignoring some syscalls with “-e ‘!nanosleep,gettimeofday,sigprocmask’” I managed to catch puppet hanging forever after executing the following (line numbers prepended for easy commenting later):
1 10764 pipe([5, 7]) = 0
2 10764 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cf0b18) = 10833
3 10833 close(5
4 10764 close(7
5 10833 <... close resumed> ) = 0
6 10764 <... close resumed> ) = 0
7 10833 dup2(7, 1
8 10764 fcntl64(5, F_GETFL
9 10833 <... dup2 resumed> ) = 1
10 10764 <... fcntl64 resumed> ) = 0 (flags O_RDONLY)
11 10833 close(7
12 10764 fstat64(5,
13 10833 <... close resumed> ) = 0
14 10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
15 10833 close(3
16 10764 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0
17 10833 <... close resumed> ) = 0
18 10764 <... mmap2 resumed> ) = 0xb7843000
19 10833 close(4
20 10764 _llseek(5, 0,
21 10833 <... close resumed> ) = 0
22 10764 <... _llseek resumed> 0xbf7f5270, SEEK_CUR) = -1 ESPIPE (Illegal seek)
23 10833 close(5
24 10764 fstat64(5,
25 10833 <... close resumed> ) = -1 EBADF (Bad file descriptor)
26 10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
27 10833 close(6) = 0
28 10833 close(7
29 10764 select(6, [5], [], [], {0, 16463}
30 10833 <... close resumed> ) = -1 EBADF (Bad file descriptor)
31 10833 close(8) = -1 EBADF (Bad file descriptor)
32 10833 close(9) = -1 EBADF (Bad file descriptor)
33 10833 close(10) = -1 EBADF (Bad file descriptor)
34 10833 close(11) = -1 EBADF (Bad file descriptor)
35 [...]
36 10833 close(253) = -1 EBADF (Bad file descriptor)
37 10833 close(254) = -1 EBADF (Bad file descriptor)
38 10833 close(255) = -1 EBADF (Bad file descriptor)
39 10833 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars */]) = 0
40 10833 brk(0) = 0x8058000
41 10833 uname({sys="Linux", node="beczulka", ...}) = 0
42 [...]
43 10833 write(1, "eth0 Link encap:Ethernet H"..., 378
44 10764 <... select resumed> ) = 1 (in [5], left {0, 40000})
45 10833 <... write resumed> ) = 378
46 10833 exit_group(0) = ?
47 10764 --- SIGCHLD (Child exited) @ 0 (0) ---
48 10764 select(6, [5], [], [], {0, 44385}) = 1 (in [5], left {0, 44385})
49 10764 read(5, "eth0 Link encap:Ethernet H"..., 1024) = 378
50 10764 select(6, [5], [], [], {0, 44025}) = 1 (in [5], left {0, 44025})
51 10764 read(5, "", 1024) = 0
52 10764 close(5) = 0
53 10764 munmap(0xb7843000, 4096) = 0
54 10764 rt_sigaction(SIGHUP, {SIG_IGN}, {0xb7f438f0, [HUP], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
55 10764 rt_sigaction(SIGQUIT, {SIG_IGN}, {0xb7f438f0, [QUIT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
56 10764 rt_sigaction(SIGINT, {SIG_IGN}, {0xb7f438f0, [INT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
57 10764 waitpid(10833, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 10833
58 10764 rt_sigaction(SIGHUP, {0xb7f438f0, [HUP], SA_RESTORER|SA_RESTART, 0xb7d1c128}, {SIG_IGN}, 8) = 0
59 10764 rt_sigaction(SIGQUIT, {0xb7f438f0, [QUIT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, {SIG_IGN}, 8) = 0
60 10764 rt_sigaction(SIGINT, {0xb7f438f0, [INT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, {SIG_IGN}, 8) = 0
61 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
62 10764 pipe([5, 7]) = 0
63 10764 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cf0b18) = 10834
64 10834 close(5
65 10764 close(7
66 10834 <... close resumed> ) = 0
67 10764 <... close resumed> ) = 0
68 10834 dup2(7, 1
69 10764 fcntl64(5, F_GETFL
70 10834 <... dup2 resumed> ) = 1
71 10764 <... fcntl64 resumed> ) = 0 (flags O_RDONLY)
72 10834 close(7
73 10764 fstat64(5,
74 10834 <... close resumed> ) = 0
75 10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
76 10834 close(3
77 10764 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0
78 10834 <... close resumed> ) = 0
79 10764 <... mmap2 resumed> ) = 0xb7843000
80 10834 close(4
81 10764 _llseek(5, 0,
82 10834 <... close resumed> ) = 0
83 10764 <... _llseek resumed> 0xbf7f6510, SEEK_CUR) = -1 ESPIPE (Illegal seek)
84 10834 close(5
85 10764 fstat64(5,
86 10834 <... close resumed> ) = -1 EBADF (Bad file descriptor)
87 10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
88 10834 close(6) = 0
89 10834 close(7
90 10764 select(6, [5], [], [], {0, 32148}
91 10834 <... close resumed> ) = -1 EBADF (Bad file descriptor)
92 10834 close(8) = -1 EBADF (Bad file descriptor)
93 10834 close(9) = -1 EBADF (Bad file descriptor)
94 10834 close(10) = -1 EBADF (Bad file descriptor)
95 [...]
96 10834 close(253) = -1 EBADF (Bad file descriptor)
97 10834 close(254) = -1 EBADF (Bad file descriptor)
98 10834 close(255) = -1 EBADF (Bad file descriptor)
99 10834 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
100 10834 brk(0) = 0x8058000
101 10834 uname({sys="Linux", node="beczulka", ...}) = 0
102 10834 write(1, "lo Link encap:Local Loopb"..., 781
103 10764 <... select resumed> ) = 1 (in [5], left {0, 0})
104 10834 <... write resumed> ) = 781
105 10834 exit_group(0) = ?
106 10764 --- SIGCHLD (Child exited) @ 0 (0) ---
107 10764 select(6, [5], [], [], {0, 0}) = 1 (in [5], left {0, 0})
108 10764 waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 10834
109 10764 waitpid(-1, 0xbf7ee868, WNOHANG) = 0
110 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
111 10764 read(5, "lo Link encap:Local Loopb"..., 1024) = 781
112 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
113 10764 select(6, [5], [], [], {0, 59201}) = 1 (in [5], left {0, 59201})
114 10764 read(5, "", 1024) = 0
115 10764 close(5) = 0
116 10764 munmap(0xb7843000, 4096) = 0
117 10764 rt_sigaction(SIGHUP, {SIG_IGN}, {0xb7f438f0, [HUP], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
118 10764 rt_sigaction(SIGQUIT, {SIG_IGN}, {0xb7f438f0, [QUIT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
119 10764 rt_sigaction(SIGINT, {SIG_IGN}, {0xb7f438f0, [INT], SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
120 10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
121 10764 select(0, [], [], [], {0, 57695}) = 0 (Timeout)
122 10764 select(0, [], [], [], {0, 4040}) = 0 (Timeout)
123 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
124 10764 waitpid(-1, 0xbf7ee868, WNOHANG) = 0
125 10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
126 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
127 10764 select(0, [], [], [], {0, 59357}) = 0 (Timeout)
128 10764 select(0, [], [], [], {0, 920}) = 0 (Timeout)
129 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
130 10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
131 10764 waitpid(-1, 0xbf7ee868, WNOHANG) = 0
132 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
133 10764 select(0, [], [], [], {0, 59303}) = 0 (Timeout)
134 10764 select(0, [], [], [], {0, 901}) = 0 (Timeout)
135 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
136 10764 waitpid(-1, 0xbf7ee868, WNOHANG) = 0
137 10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
138 10764 select(0, [], [], [], {0, 0}) = 0 (Timeout)
139 10764 select(0, [], [], [], {0, 59654}) = 0 (Timeout)
[...] goes on forever
My comments: * lines 1-38 is basic pipe setup/fork/close all FDs * 39-46: “ifconfig wlan0” does its thing * 47-60: puppet reads ifconfig’s output first, and then reaps it
So far, so good – all is well. But now notice what happens on the following lines: the command this time is just “ifconfig”, and all looks similar until line 106. However then: * line 107: there the timeout in select() is zero, * line 108: the child is reaped straight away, * line 111: only now the output is read, * line 120: puppet makes an attempt to reap the same child again!
The circumstances are slightly different, because in the first case ECHILD is returned and reported, but in the second case, no error is returned, and puppet waits forever for a child with PID 10834 to exit, which will never happen.
However I think that the reason for a race might be the same – some generic SIGCHLD handler that comes in and reaps the child before some other code can call wait()?
FYI grepping through the log reveals that the following execve()s were called before it hung:
10766 execve("/usr/bin/which", ["which", "which"], [/* 21 vars */]) = 0
10768 execve("/usr/bin/which", ["which", "uname"], [/* 21 vars */]) = 0
10769 execve("/bin/uname", ["uname", "-s"], [/* 18 vars */]) = 0
10771 execve("/usr/bin/which", ["which", "uname"], [/* 21 vars */]) = 0
10772 execve("/bin/uname", ["uname", "-r"], [/* 18 vars */]) = 0
10774 execve("/usr/bin/which", commit:"which", "hostname"], [/* 21 vars */) = 0
10775 execve("/bin/hostname", ["hostname"], [/* 18 vars */]) = 0
10777 execve("/usr/bin/which", ["which", "dnsdomainname"], [/* 21 vars */]) = 0
10778 execve("/bin/dnsdomainname", ["dnsdomainname"], [/* 18 vars */]) = 0
10780 execve("/usr/bin/which", ["which", "domainname"], [/* 21 vars */]) = 0
10782 execve("/usr/bin/which", commit:"which", "sendmail"], [/* 21 vars */) = 0
10784 execve("/usr/bin/which", ["which", "dnsdomainname"], [/* 21 vars */]) = 0
10785 execve("/bin/dnsdomainname", ["dnsdomainname"], [/* 18 vars */]) = 0
10787 execve("/usr/bin/which", ["which", "domainname"], [/* 21 vars */]) = 0
10789 execve("/usr/bin/which", ["which", "lsb_release"], [/* 21 vars */]) = 0
10791 execve("/usr/bin/lsb_release", commit:"lsb_release", "-a"], [/* 21 vars */) = 0
10792 <... execve resumed> ) = 0
10793 execve("/usr/bin/apt-cache", ["apt-cache", "policy"], [/* 21 vars */]) = 0
10796 <... execve resumed> ) = 0
10797 execve("/usr/bin/dpkg-query", ["dpkg-query", "-f", "${Version} ${Provides}\n", "-W", "lsb-core", "lsb-cxx", "lsb-graphics", "lsb-desktop", "lsb-qt4", "lsb-languages", "lsb-multimedia",
+"lsb-printing"], [/* 21 vars */]) = 0
10802 execve("/sbin/ifconfig", ["/sbin/ifconfig", "-a"], [/* 18 vars */]) = 0
10804 execve("/usr/sbin/dmidecode", commit:"/usr/sbin/dmidecode"], [/* 21 vars */) = 0
10805 execve("/sbin/ifconfig", ["/sbin/ifconfig", "-a"], [/* 18 vars */]) = 0
10806 execve("/sbin/ip", ["/sbin/ip", "link", "show", "wmaster0"], [/* 18 vars */]) = 0
10807 execve("/sbin/ifconfig", ["/sbin/ifconfig", "wmaster0"], [/* 18 vars */]) = 0
10808 execve("/sbin/ip", ["/sbin/ip", "link", "show", "eth0"], [/* 18 vars */]) = 0
10809 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars */]) = 0
10811 execve("/usr/bin/which", ["which", "uptime"], [/* 21 vars */]) = 0
10812 execve("/usr/bin/uptime", ["uptime"], [/* 18 vars */]) = 0
10813 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
10816 execve("/usr/bin/which", ["which", "host"], [/* 21 vars */]) = 0
10817 execve("/usr/bin/host", ["host", "beczulka"], [/* 18 vars */]) = 0
10821 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
10823 execve("/usr/bin/which", ["which", "grep"], [/* 21 vars */]) = 0
10826 execve("/bin/grep", ["grep", "physical id", "/proc/cpuinfo"], [/* 21 vars */]
10826 <... execve resumed> ) = 0
10827 execve("/usr/bin/cut", ["cut", "-d:", "-f", "2"], [/* 21 vars */]
10827 <... execve resumed> ) = 0
10828 execve("/usr/bin/sort", ["sort", "-u"], [/* 21 vars */]
10828 <... execve resumed> ) = 0
10829 execve("/usr/bin/wc", ["wc", "-l"], [/* 21 vars */]
10829 <... execve resumed> ) = 0
10830 execve("/sbin/ip", ["/sbin/ip", "link", "show", "wlan0"], [/* 18 vars */]) = 0
10831 execve("/sbin/ifconfig", ["/sbin/ifconfig", "wlan0"], [/* 18 vars */]) = 0
10832 execve("/sbin/ip", ["/sbin/ip", "link", "show", "eth0"], [/* 18 vars */]) = 0
10833 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars */]) = 0
10834 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
This is puppet 0.24.8-1gg4 (the differences between this and Debian 0.24.8-1 only seem to touch packaging).
Updated by Marcin Owsiany almost 3 years ago
Here’s a minimal test case I could come up with, 100% repeatable for me:
$ touch /tmp/a
$ echo 'file { "/tmp/a": content => "a" }' > a.pp
$ sudo puppet --noop --debug --trace a.pp
debug: Creating default schedules
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: File does not exist
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: Changing ensure
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: 1 change(s)
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]/ensure: is absent, should be directory (noop)
debug: Finishing transaction -607699778 with 1 changes
debug: //File[/tmp/a]/checksum: Initializing checksum hash
debug: //File[/tmp/a]: Creating checksum {md5}d41d8cd98f00b204e9800998ecf8427e
debug: //File[/tmp/a]/content: Executing 'diff /tmp/a /tmp/puppet-diffing.21138.0'
0a1
> a
\ No newline at end of file
debug: //File[/tmp/a]: Changing content
debug: //File[/tmp/a]: 1 change(s)
notice: //File[/tmp/a]/content: is {md5}d41d8cd98f00b204e9800998ecf8427e, should be {md5}0cc175b9c0f1b6a831c399e269772661 (noop)
debug: Finishing transaction -606708568 with 1 changes
- so far OK.
- now shut down the network
$ sudo puppet --noop --debug --trace a.pp
Timed out seeking value for ipaddress
Timed out seeking value for ipaddress
Timed out seeking value for ipaddress
Timed out seeking value for ipaddress
warning: Could not retrieve fact ipaddress
warning: Could not retrieve fact fqdn
debug: Creating default schedules
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: File does not exist
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: Changing ensure
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]: 1 change(s)
debug: /Settings[/etc/puppet/puppet.conf]/Settings[filebucket]/File[/var/puppet/clientbucket]/ensure: is absent, should be directory (noop)
debug: Finishing transaction -606479048 with 1 changes
debug: //File[/tmp/a]/checksum: Initializing checksum hash
debug: //File[/tmp/a]: Creating checksum {md5}d41d8cd98f00b204e9800998ecf8427e
debug: //File[/tmp/a]/content: Executing 'diff /tmp/a /tmp/puppet-diffing.21345.0'
/usr/lib/ruby/1.8/puppet/util.rb:290:in `waitpid2'
/usr/lib/ruby/1.8/puppet/util.rb:290:in `execute'
/usr/lib/ruby/1.8/puppet/util/diff.rb:12:in `diff'
/usr/lib/ruby/1.8/puppet/util/diff.rb:67:in `string_file_diff'
/usr/lib/ruby/1.8/puppet/type/file/content.rb:48:in `insync?'
/usr/lib/ruby/1.8/puppet/type.rb:965:in `propertychanges'
/usr/lib/ruby/1.8/puppet/util/checksums.rb:17:in `find_all'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `each'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `find_all'
/usr/lib/ruby/1.8/puppet/type.rb:963:in `propertychanges'
/usr/lib/ruby/1.8/puppet/type/file.rb:1123:in `propertychanges'
/usr/lib/ruby/1.8/puppet/type.rb:833:in `evaluate'
/usr/lib/ruby/1.8/puppet/transaction.rb:60:in `apply'
/usr/lib/ruby/1.8/puppet/transaction.rb:239:in `eval_resource'
/usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
/usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
/usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
/usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
/usr/lib/ruby/1.8/puppet/transaction.rb:238:in `eval_resource'
/usr/lib/ruby/1.8/puppet/transaction.rb:310:in `evaluate'
/usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
/usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
/usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
/usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
/usr/lib/ruby/1.8/puppet/transaction.rb:309:in `evaluate'
/usr/lib/ruby/1.8/puppet/transaction.rb:303:in `collect'
/usr/lib/ruby/1.8/puppet/transaction.rb:303:in `evaluate'
/usr/lib/ruby/1.8/puppet/node/catalog.rb:124:in `apply'
/usr/bin/puppet:233
err: //File[/tmp/a]: Failed to retrieve current state of resource: No child processes
debug: Finishing transaction -613159068 with 0 changes
Updated by Luke Kanies about 2 years ago
- Assignee changed from Luke Kanies to Markus Roberts
Updated by Matt Robinson over 1 year ago
Pairing with William just now this doesn’t seem to be an issue anymore… Not sure why.
Updated by Paul Nasrat over 1 year ago
Matt/William – could you be a bit more specific on what you tested, on what OS/versions. The behaviour is still clearly wrong in facter however it may be fiddly to reproduce.
Updated by Michael Scherer about 1 year ago
I can replicate this when puppet is run for some time ( restarting usually make the problem disappear, but it come back later ). I run it on 2 mandriva 2010.2 server, one in a vm, another one on a quite loaded server ( iso creation ). On one of the server that I tried to inspect, there was 2 puppet process, one of them being waiting on exec(‘hostname’) to return.
# rpm -q puppet facter
puppet-2.6.2-1mdv2010.1
facter-1.5.7-1mdv2010.0
I can update to latest stable release if needed.
Updated by Michael Scherer about 1 year ago
I started to notice this on a 3rd server, and the issue occurs ( thanks to nagios to monitor it ) around 4h in the morning, just when the logrotate job is running. So I suspect the issue is linked to the load and I/O. While this will not be a proper fix for the core issue, I will see if the patch I sent for 7138 ( for not using exec for hostname ) lessen the problem for me.
Updated by Michael Stahnke 11 months ago
- Target version changed from 1.6.0 to 1.6.x
Updated by Adrien Thebo 7 months ago
- Assignee changed from Markus Roberts to Adrien Thebo
Updated by Adrien Thebo 6 months ago
- File deleted (
sigchld.patch)
Updated by Adrien Thebo 6 months ago
- File deleted (
sigchld.patch)
Updated by John Florian about 1 month ago
Matt Robinson wrote:
Pairing with William just now this doesn’t seem to be an issue anymore… Not sure why.
I have to disagree. I am still seeing this as of:
facter-1.6.6-1.fc16.i686 puppet-2.6.14-1.fc16.noarch
I’m afraid I’m losing confidence in puppet/facter as a technology if such a critical issue hasn’t been tackled in 3 years now, especially given the near solution I provided early on. I don’t mean to disparage anyone who’s worked on this issue — I truly appreciate the efforts — but I would really expected to have a solution shortly after I made comment 12. I really don’t want to learn ruby too, I’ve enough on my plate already. Nor do I really relish the idea of abandoning puppet after I’ve given it this much commitment, but at some point I must start considering other options here.
Updated by Daniel Grafe about 1 month ago
We were also facing this issue. Since we are initially setting up our machines locally with puppet agent we don’t have a working network infrastructure (DNS resolving, interfaces, static host file…). This is causing the host tool to run for a longer time than the fact timeout allows and the greedy “block on waitall” thread is created. This thread should collect only Zombies but it is also collecting the child processes from the following puppet run. Puppet then fails on waitpid2 and terminates with an exception.
Please find attached our patch providing a workaround. It is basically realizing the exec after forking so we have a pid. The pid is saved in a class variable (big ugly hack!) so we can use this for waitpid in the case of a timeout.
This workaround is probably not the final solution for this issue and the exec for windows is still untested since we’re running on linux only.
Updated by Adrien Thebo 11 days ago
- Assignee deleted (
Adrien Thebo)