Bug #3013

util.rb:execute broken on Ruby <1.8.3

Added by Ricky Zhou about 2 years ago. Updated almost 2 years ago.

Status:Closed Start date:01/07/2010
Priority:Urgent Due date:
Assignee:Markus Roberts % Done:

0%

Category:exec
Target version:0.25.3
Affected Puppet version:0.25.2 Branch:http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3013
Keywords:
Votes: 0

Description

Apparently the patch in ticket #2731 introduced one more issue by using readpartial, which isn’t available until ruby 1.8.3 (RHEL4 at least is affected). I’m not sure how this is normally handled in ruby, but if the readpartial function is not available, the code should fall back to sysread (along with some code for handling EINTR).

Anybody with better ruby knowledge know how this should be done?

3013_test_2.patch (5.3 kB) Markus Roberts, 01/08/2010 09:00 pm


Related issues

related to Puppet - Bug #2731: problem communicating with processes in SELinux Closed 10/16/2009
related to Puppet - Bug #2997: Puppet hangs on executable under HP-UX Closed 12/30/2009
related to Puppet - Bug #1563: [PATCH] Change Util::Execute to use pipes instead of temp... Duplicate 09/07/2008
related to Puppet - Bug #3025: apt and aptitude providers dont work on Debian Lenny / pu... Closed 01/10/2010
related to Puppet - Bug #3023: Tests needed to show that "exec"ed scripts don't get pare... Closed 01/08/2010
related to Puppet - Bug #3033: Using temporary files in /tmp to comunicate with child pr... Investigating 01/12/2010

History

Updated by James Turnbull about 2 years ago

  • Category set to exec
  • Status changed from Unreviewed to Accepted
  • Assignee set to Markus Roberts
  • Priority changed from Normal to Urgent
  • Target version set to 0.25.3

Updated by Ricky Zhou about 2 years ago

Here is an initial patch that just switches to sysread, although it would probably be better to only fall back to sysread when readpartial is not available. Unfortunately, my development setup is unavailable right now, so this is completely untested.

http://ricky.fedorapeople.org/0001-Do-not-use-readpartial-for-reading-from-pipe.patch

Updated by Markus Roberts about 2 years ago

I’m presently trying to narrow down which version are affected and establish a test setup. I explored a number of potential mitigation strategy idea last night, all of them pretty hair brained, none of them particularly effective.

I’m thinking this means we’ll need to be pushing a 0.25.3 (or a 0.25.2.1) in the next day or so. :(

Updated by Markus Roberts about 2 years ago

  • Status changed from Accepted to Needs More Information
  • Assignee changed from Markus Roberts to Ricky Zhou

Ricky —

I’m trying to validate that this is how we want to handle it (as well as looking for any other related problems) and I’m curious about the EINTR rescue:

  •                  rescue Errno::EINTR
    
  •                      # Interrupted syscall, try again
    

Specifically, what circumstances are you envisioning where EINTR would be raised by the sysread and we would still want to retry? I’m not seeing how to get there, and my attempts to cause it have failed.

Updated by Ricky Zhou about 2 years ago

When writing the sysread version, I looked at the differences between sysread and readpartial outlined at http://ruby-doc.org/core/classes/IO.html#M002269. One of those differences is that readpartial takes care of restarting on EINTR (I did not need to do the same for EAGAIN because that only happens with O_NONBLOCK enabled, which is not the case with IO.pipe).

EINTR would happen if the read syscall were interrupted by a signal. I wouldn’t know how to get the timing right to actually produce this on purpose though, but it would be unnecessary to just fail when it’s possible to retry the syscall.

Updated by Markus Roberts about 2 years ago

  • Status changed from Needs More Information to Accepted
  • Assignee changed from Ricky Zhou to Markus Roberts

On IRC Ricky came up with this example of sysread returning an EINTR

#!/usr/bin/ruby
trap(:USR1) do
    puts "Caught SIGUSR1"
end 

r,w = IO.pipe
puts Process.pid
r.sysread(1)

and also noted that the same problem, at least potentially, exists in the SELinux code (where it reads /proc/mounts).

Updated by Markus Roberts about 2 years ago

  • Status changed from Accepted to In Topic Branch Pending Review
  • Branch set to http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3013

I have a test fix for this up on http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3013, which incorporates Ricky’s change and points that came up in discussion on IRC. I’d like to get it tested as thoroughly as possible so we can get a 0.25.3 out in the next day or two.

Things I’d like confirmed:

  • Execs that capture output (e.g., using yum, rpm, etc.) work under ruby 1.8.1 with this patch.
  • Ditto on other version of ruby (I’m especially interested in 1.8.2, which is an edge-case for this problem; if you can test under 1.8.2 please report the patch level)
  • Execs that capture more than 4k worth of output, ditto
  • Testing this under HPUX to make sure we haven’t reverted #2997
  • General pounding on it to make sure we haven’t just moved the problem

If you can test any of these, it would be greatly appreciated; please post your results to the ticket.

In the meantime, I’m going to be trying to add so spec-tests to go with the patch, to reduce the chance we ever fall in this particular pit again.

Updated by James Turnbull about 2 years ago

  • File _3013.patch added

Updated by Mark Plaksin about 2 years ago

[Edited by Markus]

Mark tested on RHEL 4.7 and reported a set of errors that appear unrelated to this patch and the bug(s) behind it. I’ve moved his report to a new ticket (#3019) to begin investigating those issues in parallel.

Updated by Markus Roberts about 2 years ago

Revised patch & branch with a start at tests and tweaks for issues discovered during testing.

Updated by Markus Roberts about 2 years ago

  • File deleted (_3013.patch)

Updated by Markus Roberts about 2 years ago

Here’s the patch, if you prefer using that to the git branch; note that this patch only makes sense applied to 0.25.2, as it’s attempting to fix a set of closely related bugs in that release.

Updated by Ricky Zhou about 2 years ago

A few comments on the patch:

Why was exit replaced with exit!? Previously, the use of exit! caused a bug where the output from the “puts detail” was lost because stdout was not flushed with exit!. I mentioned a case where this previously showed up in comment 11 on ticket #2731. As a simpler example, the below program gives no output on my machine.

#!/usr/bin/ruby
print 'Flushed.'
#exit
exit!

Also, the second exit after the begin block should probably be exit(1), or maybe even something akin to assert(false) in ruby, since things are definitely not right if execution gets past that block (in fact, is it even possible at all)?

One last thing – the += in the /proc/mounts code should probably be changed to <<, since that doesn’t create a new string for each iteration of the loop.

Updated by Markus Roberts about 2 years ago

Why was exit replaced with exit!?

Because calling exit calls any (parental) at_exit routines in the child process when the child exits; this can cause all manner of confusion. This is one of the ones that I caught in testing. Forked children should always use exit! in ruby.

Previously, the use of exit! caused a bug where the output from the “puts detail” was lost because stdout was not flushed with exit!. I mentioned a case where this previously showed up in comment 11 on ticket #2731. As a simpler example, the below program gives no output on my machine.

!/usr/bin/ruby

print ‘Flushed.’

exit

exit!

Ah. That explains why the change was made. Looking into that was on my things-to-do list for this afternoon.

The correct solution is to do a

$stdout.flush

before the exit, as that’s all we’re really wanting out of the at_exit code.

Also, the second exit after the begin block should probably be exit(1), or maybe even something akin to assert(false) in ruby, since things are definitely not right if execution gets past that block (in fact, is it even possible at all)?

That was an artifact of my trying to untangle what was going on with the multiple execution of the at_exit code from a problem with the trapping of Errno:XXXX errors. It should be removed and the rescue clause above replaced with:

rescue Object => detail

One last thing – the += in the /proc/mounts code should probably be changed to <<, since that doesn’t create a new string for each iteration of the loop.

I suppose, yes, although the memory allocation patterns aren’t significantly different IIRC—you save the object header but most of the turnover is in the buffer, and that’s still going to go through roughly the same allocation/GC process.

I’ll make sure these changes are in the patch; what systems did you manage to test on, and did you find anything else?

Updated by Ricky Zhou about 2 years ago

Markus Roberts wrote:

The correct solution is to do a

$stdout.flush

before the exit, as that’s all we’re really wanting out of the at_exit code.

Ah, sounds good to me.

I’ll make sure these changes are in the patch; what systems did you manage to test on, and did you find anything else?

I’ve got nothing else. I was only able to test on a RHEL 5.4 machine (ruby-1.8.5-5.el5_3.7) for now, and it worked fine there.

Updated by Markus Roberts about 2 years ago

I’ve updated the branch with the changes above and improved tests.

On the basis of those tests, I’m reluctantly leaning towards saying we should remove the whole pipe based communication patch to produce 0.25.3.

The problem is that, with or without readpartial it still hangs when the child process is misbehaved (see ticket #1563 comment #7 for the real world case):

Puppet::Util execute when readpartial is available – should be resilient to asynchronous signals – should not mind if the other end closes the pipe before completing – should not hang if the child process is misbehaved (FAILED – 1)

Puppet::Util execute when readpartial is unavailable – should be resilient to asynchronous signals – should not mind if the other end closes the pipe before completing – should not hang if the child process is misbehaved (FAILED – 2)

1) Timeout::Error in ‘Puppet::Util execute when readpartial is available should not hang if the child process is misbehaved’ execution expired spec/unit/util.rb:91: spec/unit/util.rb:90: /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:22:in run' /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:17:ineach' /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:17:in `run' spec/unit/util.rb:5:

2) Timeout::Error in ‘Puppet::Util execute when readpartial is unavailable should not hang if the child process is misbehaved’ execution expired spec/unit/util.rb:91: spec/unit/util.rb:90: /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:22:in run' /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:17:ineach' /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:17:in `run' spec/unit/util.rb:5:

The file version passes these tests (but does drop some signals—haven’t worked out why in detail yet, but I suspect it’s a granularity issue).

I’m not seeing any good way around this. I’m not giving up, but I’m not optimistic either.

— Markus

Updated by Markus Roberts about 2 years ago

Belay that pessimism. I think if we combine the child termination trap with the eof trap there may be a sane (if somewhat recondite) way to thread this needle. I’ll try to code it up and report back.

Updated by Ricky Zhou about 2 years ago

Markus Roberts wrote:

Belay that pessimism. I think if we combine the child termination trap with the eof trap there may be a sane (if somewhat recondite) way to thread this needle. I’ll try to code it up and report back.

Ugh. I spent the last hour trying to think up solutions with SIGCHLD handlers, select, and other fun. Unfortunately, one problem I don’t think we can avoid is that even if we manage to prevent the indefinite blocking read, as soon as we close the read end of the pipe and the misbehaved daemon tries to write anything to stdout (the write end), it will get a SIGPIPE (and probably get terminated as a result).

Updated by Markus Roberts about 2 years ago

Yeah. There’s a reason they put up the huge “Abandon all hope” sign over this stuff.

As for the SIGPIPE problem, I’m thinking that daemons writing to stdout is frowned upon, but that isn’t a solution it’s just shifting the blame.

My SIGCHLD idea and a half dozen variation have all gotten nowhere, with the complications you’d expect and one I hadn’t thought of, at least explicitly: we aren’t dealing with the OS directly and some of the cleverer tricks also run afoul of ruby’s assumptions.

I haven’t exhausted all the permutations, but one way or another I’m going to run out of options before too much longer.

Updated by Markus Roberts about 2 years ago

Status update: I’ve got a version that passes everything but the new SIGPIPE test, which I think may be doomed.

Updated by Markus Roberts about 2 years ago

I have again updated the branch; this version includes a test for the SIGPIPE issue (which it fails) and the other edge cases I’ve been able to glean from these tickets (23 examples in total) which it passes. In some of these cases it involves transforming aberrant condition into other which are more manageable but no more “correct” (e.g. turning a hang into a failure); the exact nature of these mappings (which implicitly occur in the old code as well) should probably be addressed at some point.

I have a potential solution to the SIGPIPE problem (not included in the posted code) which is not without cost. Specifically, if we detect the case and do not close the read-pipe (our end of the child’s output) we can avoid SIGPIPEing the grandchild (at least until we exit), but in doing so we incur the risk of collecting (many?) orphaned pipes (there’s also the complexity of GC protection).

The posted version, however, does not cause problems unless the grandchild writes to the pipe (there is a test to verify this). So I keep coming back to my earlier point that daemons, as a general rule, are so far as I know expected to refrain from writing to STDOUT, and deserve what they get if they try. So the question is, how should this concern be prioritized? Is it something that we should go to extraordinary lengths to protect against, or is it something we can safely ignore?

Updated by Ricky Zhou about 2 years ago

Here is an attempt which should pass the SIGPIPE test by ignoring SIGPIPE in the child process. This causes attempts to write to stdout/stderr to fail with EPIPE, but the program should be terminated as a result. I think that is acceptable behavior, as something very similar happens when these badly behaved daemons are started on a terminal that is then closed (any attempts to write to the terminal will fail with EIO).

However, I’m still not totally happy with the solution – what if the daemon happens to have its own specialized SIGPIPE handler or resets the signal handler itself? Basically, if we’re accounting for absolutely any sort of software, including stuff which doesn’t daemonize properly, still writes to stdout/stderr, and messes with signals, I can’t think of any way to make them all work.

If you’re rushing to get 0.25.3 out and working with RHEL4, I’m fine with reverting to the tempfile method and then looking deeper into other approaches later.

Anyway, here’s the diff against d4e92861a5dca9ed7829b094eb13eaaedecebf46 on your ticket branch. This method uses select and waitpid with WNOHANG to stop trying to read once the child exits.

diff --git a/lib/puppet/util.rb b/lib/puppet/util.rb
index 71d6a1f..4a2e260 100644
--- a/lib/puppet/util.rb
+++ b/lib/puppet/util.rb
@@ -266,6 +266,7 @@ module Util
         child_pid = Kernel.fork
         $VERBOSE = oldverb
         if child_pid
+            # Parent process executes this
             output_write.close
 
             # Read output in if required
@@ -274,7 +275,17 @@ module Util
                 begin
                     method = output_read.respond_to?(:readpartial) ? :readpartial : :sysread
                     begin
-                        output << output_read.send(method,4096)
+                        if select([output_read], nil, nil, 0.1)
+                            # Make sure that there is data to be read to avoid
+                            # indefinitely blocking reads.
+                            output << output_read.send(method, 4096)
+                        elsif Process.waitpid(child_pid, Process::WNOHANG)
+                            # Once the child has exited, immediately stop
+                            # attempting to read, even if EOF has not been
+                            # reached.
+                            child_status = $?.exitstatus
+                            break
+                        end
                     rescue Errno::EINTR
                         # Just popping out to handle an unrelated signal, but we're not done
                     end while true
@@ -285,9 +296,10 @@ module Util
                 end
             end
 
-            # Parent process executes this
-            Process.waitpid(child_pid)
-            child_status = $?.exitstatus
+            if ! child_status
+                Process.waitpid(child_pid)
+                child_status = $?.exitstatus
+            end
         else
             # Child process executes this
             Process.setsid
@@ -299,6 +311,13 @@ module Util
                     $stdout.reopen('/dev/null', 'w')
                     $stderr.reopen('/dev/null', 'w')
                 else
+                    # Ignore SIGPIPE.  This allows daemons which do not
+                    # properly close stdout/stderr to receive EPIPE on
+                    # attempting to write to the pipe instead of being
+                    # terminated with a SIGPIPE.  See ticket #1563,
+                    # comment 7 and ticket #3013 for details.
+                    trap('SIGPIPE', 'SIG_IGN')
+
                     $stdout.reopen(output_write)
                     if arguments[:combine]
                         $stderr.reopen(output_write)

Updated by Ricky Zhou about 2 years ago

Ricky Zhou wrote:

Here is an attempt which should pass the SIGPIPE test by ignoring SIGPIPE in the child process.

Oops, I just read the test, and this version will actually not pass the test, because the test uses its own SIGPIPE handler, which will override the SIG_IGN. It will prevent bad daemons without their own SIGPIPE handlers from being terminated though.

Updated by Markus Roberts about 2 years ago

I haven’t entirely digested your patch but there are a few points I should probably clarify to make sure we’re on the same page:

  • In the test, there are actually two SIGPIPEs going on: one, which we’d like to prevent, which the grandchild (represented by the fork block in the exec_test_helper) receives when it tries to write to the closed pipe the second, which we don’t, which is used in this testing rig for the program acting as a mock or surrogate for the misbehaved daemon to indicate to the main test program that it received a SIGPIPE
  • I’m assuming that doing anything to affect the behavior of the client is out of scope; if we’re calling out to a ill-behaved process we need to cope with it
  • In my experimentation (also noted on one of the earlier tickets) select is ineffective with pipes; they always return as ready. I have not investigated the cause of this, but have confirmed it.
  • I would not expect that trapping SIGPIPE (or anything else) in the code before a Kernel.exec() would have the desired effect; if I am not mistaken all this goes away when the process is replaced by the exec.

Updated by Markus Roberts about 2 years ago

I like the WNOHANG idea, and have incorporated it into my next round; it greatly simplifies the abandon pipes of ill-behaved children idea.

So we now have a version that passes all the tests (translation: “it’s time to write more tests”), but what we really need is to decide how far we want to go to deal with exec tasks that launch psychotic daemons. We may be dealing with a real issue, or we may be fading off into hypothetical situations that will never arise.

If they’re chimeral (specifically if they wouldn’t ever write to the pipe, but just hold on to it in error) we should probably just close our end and move on.

If it is a real issue the best approach might be to fight fire with fire and pass our end of the pipe off to a new (detached) process that waits for the prodigal daemons to either close the pipe or die, and then closes our end and exits cleanly.

Any idea what pragmatic situation with regards to such daemons is, our how we might ascertain it?

P.S. I’ve updated the branch, and am now going to write more tests.

Updated by Markus Roberts about 2 years ago

New branch up, with my best approximation of a solution. After researching the osirid problem (our worst known case) and such similar bug reports as google provides an inquirer, it appears that the SIGPIPE problem is not likely to be an issue. Most daemons do not retain the caller’s STDOUT, and those do generally do not write to it (they retain it as an oversight, not with the intention to use it). Daemon that retain the caller’s STDOUT and attempt o write to it after the caller has ceased to exist without checking for the possible failure (if any exist) would generally be considered buggy to the point that we may be forgiven for not accommodating them.

Get the branch here & try it and/or see the dev-list for the commit and any ensuing discussion.

— Markus

P.S. If anyone has access to a Debian Etch box and would be willing to try and reproduce the osirid problem, I would be very interested in the results.

Updated by Ricky Zhou about 2 years ago

Markus Roberts wrote:

  • In my experimentation (also noted on one of the earlier tickets) select is ineffective with pipes; they always return as ready. I have not investigated the cause of this, but have confirmed it.

Are you sure? Here is a test I tried on a Fedora 12 machine with ruby-1.8.6.383-4.fc12.x86_64 and RHEL5 with ruby-1.8.5-5.el5_3.7 which shows select behaving as expected:

[ricky@gamma ~]$ cat test.rb 
#!/usr/bin/ruby
r,w = IO.pipe
puts IO.select([r], nil, nil, 0.1)
w.puts('test')
puts IO.select([r], nil, nil, 0.1)
r.readpartial(4096)
puts IO.select([r], nil, nil, 0.1)
[ricky@gamma ~]$ ruby test.rb
nil
#
nil
  • I would not expect that trapping SIGPIPE (or anything else) in the code before a Kernel.exec() would have the desired effect; if I am not mistaken all this goes away when the process is replaced by the exec.

From the execve manpage:

* POSIX.1-2001 specifies that the dispositions of any signals that are ignored
  or set to the default are left unchanged.  POSIX.1-2001 specifies one
  exception:  if  SIGCHLD  is being ignored, then an implementation may leave
  the disposition unchanged or reset it to the default; Linux does the former.

A test that I ran seemed to show that it did make a difference with the type of daemon that we discussed. I’m also perfectly fine with letting those daemons die a well-deserved death though :–)

I’m reading through your latest patch now. Glad that we’re getting close to a solution for this :–)

Updated by Markus Roberts about 2 years ago

Re: select. That’s interesting. I’ll compare with the test I did and see if I can figure out why the difference.

As for setting SIGPIPE to IGNORE I suppose that’s a reasonable option; it would handle the case of daemons that ignored the issue completely in a reasonable way, without affecting those that wanted to deal with it themselves. No idea at all (or at least none I like in the least) for how to test it though.

I too am glad we seem to be seeing light at the end, etc., and even more glad that it’s a “we”—these things are always a lot easier when you’ve got multiple eyes & brains on the problem.

Updated by Ricky Zhou about 2 years ago

Markus Roberts wrote:

Re: select. That’s interesting. I’ll compare with the test I did and see if I can figure out why the difference.

One thing that might bite people is that once the write end is closed, select will keep returning true, since a read can technically happen without blocking.

As for setting SIGPIPE to IGNORE I suppose that’s a reasonable option; it would handle the case of daemons that ignored the issue completely in a reasonable way, without affecting those that wanted to deal with it themselves. No idea at all (or at least none I like in the least) for how to test it though.

I tested using the following C program:

#include 
#include 

int main(int argc, char **argv) {
    int rc;
    pid_t pid;
    pid = fork();
    if (pid == 0) {
        sleep(5);
        printf("Writing to stderr...\n");
        rc = fprintf(stderr, "Evil text!\n");
        printf("Return code: %d\n", rc);
        for (;;);
    }
    printf("Daemonized (badly), pid %d\n", pid);
    return 0;
}

and this test script:

#!/usr/bin/ruby
require 'puppet'
require 'puppet/util'

puts Puppet::Util.execute(
    ['/path/to/daemontest'],
    :failonfail => true,
    :combine => true,
    :squelch => false
)

When I removed the trap SIGPIPE line, the daemon would die after 5 seconds. With the trap line, the test program survived the broken pipe (and kept running indefinitely). Writing a nice, reproducible test for this would take a bit more thinking though :–)

Updated by Markus Roberts about 2 years ago

Agreed on all counts. I’ll incorporate it and try to come up with a test. One other thing I was (needlessly, as it turns out) concerned about was the effect on other traps (such as the one in the test you mentioned above). Traps in the child fork are independent, so we’re good there as well.

The test’ll be the bitch of it.

Updated by Markus Roberts about 2 years ago

Branch updated to include the defaults IGNORE SIGPIPE trap.

I was able to test it to my satisfaction (independent of but analogous to Ricky’s test) but making something that was portable and actually tested the behavior defeated me. There is a skeleton of the test in the patch, with a pending note explaining why it isn’t a live test.

Updated by Ricky Zhou about 2 years ago

Any luck reproducing that select weirdness? That seems to be the only significant difference between our patches, and I think select would be a little cleaner and simpler than using timeout. Note that if the timeout occurs right after the sysread/readpartial call and before the timer thread finishes, it looks like this will drop data:

output << timeout(1) {
    output_read.send(method,1)
    # timeout happens right here
} while true

I’m also not crazy about reading the output in one byte at a time. That’s a lot of syscall overhead, especially for large outputs.

Updated by Markus Roberts about 2 years ago

I’m re-trying it now.

One point though—unless I’m mistaken, even if the select works we have no assurance that there will be 4096 bytes available, so we could easily hang there waiting for more data (or an EOF) that will never come.

I’m not thrilled with the byte at a time solution either, but at least in the case where we resort to sysread I see no good alternative.

— Markus

P.S. I’ve also been chasing down two other issues, one of which (#3013) appears to be closely related to / identical with the hypothetical reversion of #1563; the branch presently posted on this ticket does not fix #3025.

Updated by Markus Roberts about 2 years ago

I can not reproduce the select problem I was having initially; I though briefly that it might have been associated with my experiments with setting Fcntl::O_NONBLOCK on output_read via will_block = false (from the event-loop extensions) but that didn’t pan out.

Running the tests with the select in place of the timeout I get sporadic errors such as:

1) ‘Puppet::Util execute when readpartial is available should not miss output if the other end doles it out bit at a time’ FAILED expected: “3937\n3937\n3937\n3937\n3937\n”,

 got: "3937\n3937\n3937\n3937\n" (using ==)

spec/unit/util.rb:115: /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:22:in run' /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:17:ineach' /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:17:in `run' spec/unit/util.rb:5:

2) ‘Puppet::Util execute when readpartial is unavailable should be resilient to asynchronous signals’ FAILED expected: >= 150,

 got:    5

spec/unit/util.rb:107: /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:22:in run' /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:17:ineach' /Users/markus/projects/puppet/spec/monkey_patches/add_confine_and_runnable_to_rspec_dsl.rb:17:in `run' spec/unit/util.rb:5:

Setting the read size to 1-byte does not affect them or appear make them more or less likely. Nor does setting output_read.will_block=false.

My bigger concern at this point is #3025; no matter how sure we are that it ought to work, if it reliably locks up in the real world, we’ll have to revert.

If I can find a way to resolve the dropped data problem with the select version (I suspect it’s a race condition with the waitpid) I’ll post it here and ask the user on #3025 to test that as well.

— Markus

Updated by Markus Roberts about 2 years ago

Branch updated to use Ricky’s select, with will_block-false and one retry to get around the race condition mentioned in the previous comment.

Updated by Markus Roberts about 2 years ago

  • Status changed from In Topic Branch Pending Review to Code Insufficient

So all of this appears to be for naught. With any of these versions we have either obvious testable failures (e.g. it loses output), a regression of a broader form of #1563 (not just Etch, and not just osirid—any package on Lenny sems to hang; see #3025 for details), or both. This none of them are usable in the real world.

I’d still like to resolve this but for 0.25.3 we’re going to have to revert the pipe stuff and reanalyze the problem after 0.25.3 is out. Fixing the mechanism described on #1563 (which the branches here demonstrably do) was not sufficient. Therefore, the analysis is inadequate and there’s some other factor we’re missing.

Updated by Ricky Zhou about 2 years ago

Markus Roberts wrote:

One point though—unless I’m mistaken, even if the select works we have no assurance that there will be 4096 bytes available, so we could easily hang there waiting for more data (or an EOF) that will never come.

I’m not thrilled with the byte at a time solution either, but at least in the case where we resort to sysread I see no good alternative.

With a pipe, if there are less than 4096 bytes available, read will return after reading just those bytes, so I don’t think you need to worry about blocking.

Leaving the tempfile for 0.25.3 sounds fine. I’ll try to look into race conditions with the select solution (perhaps we need to do a select + read after the waitpid to empty the buffer as well).

Updated by Markus Roberts about 2 years ago

The version presently posted on this branch (which was just tested) does a final select + read:

                    if select([output_read], nil, nil,0.5)
                        # Make sure that there is data to be read to avoid
                        # indefinitely blocking reads.
                        output << output_read.send(method, 4096)
                    elsif reaped_pid
                        # Once the child has exited, and select has had one more
                        # chance then stop attempting to read, even if EOF has not
                        # been reached.  (see ticket #1563 comment #7)
                        break
                    else
                        reaped_pid = Process.waitpid(child_pid, Process::WNOHANG)
                   end

That’s how I got it to stop dropping data (see comment 35, above).

Updated by Markus Roberts about 2 years ago

  • Status changed from Code Insufficient to Closed

This approach failed in field testing (see #3025).

The issue has been resolved in 0.25.3 by reverting the original patch for #2731 that introduced the pipe based code.

Exploration of the issue behind it 2731, sussing out why the code on this ticket failed, and looking for alternatives has been moved to #3033.

Also available in: Atom PDF