Bug #1947

Puppet still leaks FDs on exec

Added by Nick Phillips about 3 years ago. Updated over 1 year ago.

Status:Closed Start date:02/08/2009
Priority:Normal Due date:
Assignee:Nigel Kersten % Done:

0%

Category:exec
Target version:-
Affected Puppet version:0.24.7 Branch:
Keywords:
Votes: 0

Description

Puppet still leaks FDs on exec.

This can be when restarting a “service” resource or running an “exec” resource; anything that uses util.rb’s functions to actually do the exec.

The line “3.upto(256){|fd| IO::new(fd).close rescue nil}” (introduced in 0.24.6 I believe) is not adequate; FDs numbered above 256 remain open. I now have hosts failing in the middle of puppet catalog runs due to running out of FDs. Looking at one of the services that gets regular puppet-controlled restarts, it has FDs 1-13, and 257->985 open. Restarting it manually makes the problem go away.

Actually knowing which FDs are in use (for communication with the puppetmaster in particular, but also anything else that I haven’t yet noticed) and closing them would probably fix the problem completely; iterating over FDs in the hope that the system you are on doesn’t happen to allow higher-numbered ones and hasn’t been using them if it does, is not a good solution.


Related issues

related to Puppet - Bug #961: puppetd creating too many/not closing TCP connections Closed

History

Updated by James Turnbull about 3 years ago

  • Category set to exec
  • Status changed from Unreviewed to Accepted

See commit:923fd89a2a5d52a6ec9abeb02f6edc01c721fd91 in branch 0.24.x for previous fix.

Updated by Daniel Pittman about 3 years ago

Is there ever a case that puppet wants a file descriptor it opens inherited by a child process?

The right solution is to use O_CLOEXEC or FD_CLOEXEC on the file descriptor. This probably involves overriding the core File / IO methods, since we otherwise rely on all module implementors (A) knowing about the issue, and (B) getting it right.

http://www.ruby-doc.org/stdlib/libdoc/fcntl/rdoc/index.html

@

File.new("/tmp/example")
File.new("/tmp/test").fcntl(Fcntl::F_SETFL, Fcntl::FD_CLOEXEC)

@

Season to taste for other methods of obtaining a file descriptor, of course.

(Windows, naturally, doesn’t support FD_CLOEXEC, but does support O_INHERITABLE which can be managed through the platform API only, god love ‘em.)

Updated by Nick Phillips about 3 years ago

Thinking about it a little, it’s not clear to me why the connections would still be open at the point where e.g. a service is being restarted anyway – if they are left over from downloading file resources from the puppetmaster (seems likely, as this problem has got much worse since I added a couple of deeply-recursive file resources to my config) then surely the connections should be closed by the time we get to restarting any services…

Will dig a bit; this problem is a killer for me at the moment.

Updated by Nick Phillips almost 3 years ago

OK, I was wondering why there’d be quite so many connections open to be passed on to the child processes in the first place. It seems to me that each connection used during a catalog run is kept open until the http_pool.clear_http_instances is called at the end of the run. With keep_alive, they might get re-used, but without, a new one is created every time http_pool.http_instance is called, and they stack up in huge numbers.

This is, quite clearly, nuts.

I don’t have a clear enough grasp of the structure of this thing to know where they should be being dumped yet, but it should be relatively simple for someone with a better grasp of the intended function of it to find somewhere to close the connections.

It wouldn’t obviate the need for close-on-exec, but it would make the consequences of not having it (it does look like the HTTP and XMLRPC classes might make it awkward to get it set) far less painful.

I’d be very happy (understatement, see ;–) ) to try out patches for this issue if there were any… otherwise I’ll just keep on fumbling around until I can get it to work.

Updated by Luke Kanies almost 3 years ago

The extra connections were fixed in #961, hopefully, so that should at least mitigate the problem.

Updated by Nick Phillips almost 3 years ago

Seems resolved with #961 – so long as there will never be connections open when an exec is performed. Which probably won’t be the case if keep_alive ever returns.

Cheers,

Nick

Updated by Nigel Kersten over 1 year ago

  • Status changed from Accepted to Closed
  • Assignee set to Nigel Kersten

Also available in: Atom PDF