Bug #961
puppetd creating too many/not closing TCP connections
| Status: | Closed | Start: | ||
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | % Done: | 0% |
||
| Category: | network | |||
| Target version: | 0.24.8 | |||
| Affected version: | 0.24.1 | Branch: | ||
| Keywords: | ||||
| Votes: | 0 |
Description
With a single client puppetd run, established concurrent TCP connections between the client and the puppetmaster often exceeds 100.
If two or three clients are run in parallel, and sometimes even with a single client, the number of connections exceeds 256. On my Solaris 10 puppetmaster, the maximum number of open files defaults to 256, when this limit is exceeded for the puppetmasterd process, no further connections can be opened and we receive the following error message on the client(s):
warning: Other end went away; restarting connection and retrying
Before starting a client run, the number of connections is zero:
$ netstat -an | grep 8140.*ESTABLISH | wc -l
0
Soon after I start puppetd on the client:
puppetmaster $ netstat -an | grep 8140.*ESTABLISH | wc -l
101
Prior to 0.24.0 I don’t believe puppetd opened many connections to the puppetmaster concurrently.
I have needed to increase the limit on maximum open files on the puppetmaster, but this is not an ideal workaround as having all those sockets open consumes a lot of system resources and i’m still very limited in the number of clients we can run concurrently. So I think that this is a major issue.
Associated revisions
Revision 933b1df6d84ec34a6ff347240c0151434ecc80a9
Fixing #961 — closing existing, open connections when a new connection is requested, and closing all connections at the end of each run.
Revision 5e35166f1b7219d470916d1ee7a4e6852afaf1f9
Fixing #961 – closing the http connection after every xmlrpc call
There were apparently some circumstances that resulted in the connection not being closed; this just closes it every time if it’s still open after the rpc call is complete.
Signed-off-by: Luke Kanies luke@madstop.com
History
Updated by Luke Kanies over 2 years ago
- Status changed from 1 to Closed
- 7 set to fixed
This should be fixed in [933b1df6d84ec34a6ff347240c0151434ecc80a9], and I’ll get a release out with this fix soon.
I think there’s still a logic bug somewhere, though, so I want to look around a bit more.
Updated by martin krafft over 1 year ago
- Category set to network
- Status changed from Closed to Re-opened
- Assignee changed from Puppet Community to Luke Kanies
- Affected version set to 0.24.1
The 933b1df6d84ec34a6ff347240c0151434ecc80a9 commit might have fixed the problem, but according to James, it’s been refactored before 0.24.1 came out, and as a result, neither 0.24.1 or any later version includes the fix in lib/puppet/network/xmlrpc/client.rb, but in lib/puppet/network/http_pool.rb.
It seems, however, that the fix is buggy or a new bug has been introduced since then, because my 0.24.5 and 0.24.7 clients both leave plenty of connections around. I think that this is the root cause of Debian bug #502193 (http://bugs.debian.org/502193).
Updated by martin krafft over 1 year ago
With the help of Luke (hi Luke!), we’ve found the following:
The problem seems to be in lib/puppet/network/xmlrpc/client.rb at line 40:
begin
-> call("%s.%s" % [namespace, method.to_s],*args)
rescue [...]
It seems that the connection after the call() method is never closed. The following patch verifies and fixes that, but it only closes the connection if there isn’t an exception (it’s only a proof-of-concept):
--- a/lib/puppet/network/xmlrpc/client.rb
+++ b/lib/puppet/network/xmlrpc/client.rb
@@ -39,6 +39,10 @@ module Puppet::Network
Puppet.debug "Calling %s.%s" % [namespace, method]
begin
result = call("%s.%s" % [namespace, method.to_s],*args)
+ if http.started?
+ Puppet.err "HTTP instance still open after call to %s.%s" % [namespace, method]
+ http.finish
+ end
result
rescue OpenSSL::SSL::SSLError => detail
if detail.message =~ /bad write retry/
Puppet.warning "Transient SSL write error; restarting connection and retrying"
This outputs a whole lot of errors for me, so Luke has agreed to rewrite/refactor/fix this, and I will test.
Whether it fixes the Debian bug is currently unknown.
Updated by Luke Kanies over 1 year ago
- Status changed from Re-opened to Ready for Testing
- Assignee changed from Luke Kanies to James Turnbull
- Target version set to 0.24.8
Fixed in tickets/0.24.x/961 in my repo.
Updated by James Turnbull over 1 year ago
- Status changed from Ready for Testing to Closed
Pushed in commit:“5e35166f1b7219d470916d1ee7a4e6852afaf1f9” in branch 0.24.x