Bug #2661
puppetd exits if the master is unreachable.
| Status: | Closed | Start date: | 09/20/2009 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | % Done: | 0% |
||
| Category: | plumbing | |||
| Target version: | 0.25.2 | |||
| Affected Puppet version: | 0.25.0 | Branch: | http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/2661 | |
| Keywords: | ||||
| Votes: | 2 |
Description
In the event that the master becomes unreachable the puppetd will exit instead of use the cache, retry later etc like older puppetd did.
# iptables -I INPUT -s puppet -j DROP # puppetd --debug --trace --no-daemonize/usr/lib/ruby/1.8/timeout.rb:54:in `open': execution expired (Timeout::Error) from /usr/lib/ruby/1.8/net/http.rb:560:in `connect' from /usr/lib/ruby/1.8/timeout.rb:56:in `timeout' from /usr/lib/ruby/1.8/timeout.rb:76:in `timeout' from /usr/lib/ruby/1.8/net/http.rb:560:in `connect' from /usr/lib/ruby/1.8/net/http.rb:553:in `do_start' from /usr/lib/ruby/1.8/net/http.rb:542:in `start' from /usr/lib/ruby/1.8/net/http.rb:1035:in `request' from /usr/lib/ruby/1.8/net/http.rb:772:in `get' ... 28 levels... from /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:217:in `run' from /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:306:in `exit_on_fail' from /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:217:in `run' from /usr/sbin/puppetd:159
Related issues
History
Updated by James Turnbull over 2 years ago
- Category set to plumbing
- Status changed from Unreviewed to Accepted
Updated by James Turnbull over 2 years ago
- Target version set to 0.25.2
Updated by James Turnbull over 2 years ago
This is apparently a regression from the 0.24.x branch.
Updated by James Turnbull over 2 years ago
- Target version changed from 0.25.2 to 0.25.1
Updated by Markus Roberts over 2 years ago
When I try this I get essentially the same stack trace (I see the 28 omitted lines) and then:
err: Could not retrieve catalog from remote server: Operation timed out - connect(2) debug: Using cached catalog for test.example.com notice: Using cached catalog
And then it does it and loops. Is there perhaps some other aspect of the situation?
Updated by Markus Roberts over 2 years ago
- Status changed from Accepted to Needs More Information
- Assignee set to R.I. Pienaar
Looking at the underlying code (on master) I’m not seeing how this could happen, and can’t find any recent changes that look like they would affect the behaviour.
Can anyone else reproduce this or offer suggestions on what the missing factor might be?
Updated by R.I. Pienaar over 2 years ago
Markus Roberts wrote:
When I try this I get essentially the same stack trace (I see the 28 omitted lines) and then:
[…]
And then it does it and loops. Is there perhaps some other aspect of the situation?
Are you using actual 0.25.0 code or 0.25.x? Asked someone on IRC who confirms 0.25.1rc1 behaves properly. Can’t yet find anyone else who is on 0.25.0
I am not sure how to get ruby to show me the 28 removed lines of stacktrace, that would have been helpful too I guess.
Updated by R.I. Pienaar over 2 years ago
I updated a machine to 0.25.1rc1 and same effect.
I don’t know. I can give you access to the machine if you wish other than that I am out of ideas.
Updated by Markus Roberts over 2 years ago
R.I. Pienaar aka Volcane wrote:
I updated a machine to 0.25.1rc1 and same effect.
I don’t know. I can give you access to the machine if you wish other than that I am out of ideas.
Letting me poke at the machine may be the fastest way. Also, if you post the full transcript without
Updated by R.I. Pienaar over 2 years ago
Markus Roberts wrote:
R.I. Pienaar aka Volcane wrote:
I updated a machine to 0.25.1rc1 and same effect.
I don’t know. I can give you access to the machine if you wish other than that I am out of ideas.
Letting me poke at the machine may be the fastest way. Also, if you post the full transcript without
ing there may be some buried clew.
I don’t cut out the 28 lines of backtrace, I guess its ruby, that’s exactly how it in my terminal :(
I’ll sort out access to a machine and let u know.
Updated by Markus Roberts over 2 years ago
Markus Roberts wrote:
Also, if you post the full transcript without
ing there may be some buried clew. I don’t cut out the 28 lines of backtrace, I guess its ruby, that’s exactly how it in my terminal :(
I got that part (the 28 stack levels) ok in my reproduction. I was referring to the part above that, where you wrote “
The reason I’m interested, we may have a case where two coincident errors knock it out even though one (failing to reach the server) alone would not. This can happen when, for example, an error occurs in the process of reporting another error. There may be something in the surrounding material that points to how/where this might be happening.
Updated by R.I. Pienaar over 2 years ago
Here’s the whole thing
http://nephilim.ml.org/~rip/puppet/2661.txt
Updated by Markus Roberts over 2 years ago
- Status changed from Needs More Information to Accepted
I’ve now been able to reproduce this, but not with any great confidence in my understanding. There must be a path through the rescues; I’ll try to track it down.
Updated by Markus Roberts over 2 years ago
- Assignee changed from R.I. Pienaar to Markus Roberts
There are at least two ways this can happen.
Updated by Markus Roberts over 2 years ago
- Status changed from Accepted to In Topic Branch Pending Review
- Branch set to http://github.com/MarkusQ/puppet/tree/ticket/master/2661
The general problem is that setup routines may depend on network connectivity but aren’t always protected by rescue clauses.
The attached branch (http://github.com/MarkusQ/puppet/tree/ticket/master/2661) is primarily intended for testing. It addresses the two places where we can conclusively demonstrate the problem, but leaves the larger issue potentially unresolved.
I’d eventually favor a broader refactoring to deal with the issue (e.g. as part of the proposed error recovery / reporting cleanup) but the present targeted fix is more appropriate for a bug-fix release.
Updated by Markus Roberts over 2 years ago
- Status changed from In Topic Branch Pending Review to Ready For Checkin
Updated by James Turnbull over 2 years ago
- Status changed from Ready For Checkin to Closed
Pushed in commit:594c774dd466c85eae829eaba3898d02ae9538ca in branch 0.25.x
Updated by Peter Meier over 2 years ago
i frequently keep getting timeout errors from clients, I think mainly the following one:
/usr/lib/ruby/1.8/timeout.rb:54:in `open': execution expired (Timeout::Error)
from /usr/lib/ruby/1.8/net/http.rb:560:in `connect'
from /usr/lib/ruby/1.8/timeout.rb:56:in `timeout'
from /usr/lib/ruby/1.8/timeout.rb:76:in `timeout'
from /usr/lib/ruby/1.8/net/http.rb:560:in `connect'
from /usr/lib/ruby/1.8/net/http.rb:553:in `do_start'
from /usr/lib/ruby/1.8/net/http.rb:542:in `start'
from /usr/lib/ruby/1.8/net/http.rb:1035:in `request'
from /usr/lib/ruby/1.8/net/http.rb:772:in `get'
... 45 levels...
from /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:217:in `run'
from /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:306:in `exit_on_fail'
from /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:217:in `run'
from /usr/sbin/puppetd:159
with 0.25.1
should I open a new bug report about that or reopen this one?
Updated by James Turnbull over 2 years ago
Peter – can you do a new bug report please.
Updated by R.I. Pienaar over 2 years ago
James Turnbull wrote:
Peter – can you do a new bug report please.
I think we should keep it open, my original reported problem still exist too.
I’ve moved to cron so didnt test this after the ticket got closed, but it should be re-opened.
Updated by James Turnbull over 2 years ago
- Status changed from Closed to Investigating
- Target version changed from 0.25.1 to 0.25.2
Updated by Markus Roberts over 2 years ago
- Status changed from Investigating to Accepted
Updated by Peter Meier over 2 years ago
this introduces also another problem: it might happen that the lockfile isn’t removed. Hence puppet doesn’t run anymore at all. :/ Maybe I should open a separate bug about that?
Updated by Markus Roberts over 2 years ago
After some initial frustration I’ve found a way to reproduce these cases semi-systematically, by adding the following monkey patch to lib/puppet/util/monkey_patches.rb:
require "timeout"
module Timeout
def timeout(sec, exception=Error)
raise exception, "fake timeout" if rand(100) == 0
return yield if sec == nil or sec.zero?
raise ThreadError, "timeout within critical session" if Thread.critical
begin
x = Thread.current
y = Thread.start {
sleep sec
x.raise exception, "execution expired" if x.alive?
}
yield sec
ensure
y.kill if y and y.alive?
end
end
module_function :timeout
end
It basically forces a timeout on 1% of the operations (low enough that it doesn’t always die in the same place, but high enough that we see lots of failures).
I’m letting it run semi-autonomously for a while to collect code paths, then will revise the patch.
I don’t think we need a separate bug report for the lockfile symptom since the solution would be the identical—find the code paths that are escaping exception handling and rectify them.
Updated by Markus Roberts over 2 years ago
- Status changed from Accepted to In Topic Branch Pending Review
I’ve updated the branch to cover the additional cases found by my automated testing. In the last round (~2500 catalog runs with 1% chance of forced timeout on each operation) no additional cases turned up.
Updated by James Turnbull about 2 years ago
- Status changed from In Topic Branch Pending Review to Code Insufficient
Markus – can you rebase your branch on 0.25.x please.
Updated by Markus Roberts about 2 years ago
- Branch changed from http://github.com/MarkusQ/puppet/tree/ticket/master/2661 to http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/2661
Bit rot. Sorry ‘bout that.
Updated by Markus Roberts about 2 years ago
- Status changed from Code Insufficient to Ready For Checkin
Note new branch name: http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/2661
Updated by Peter Meier about 2 years ago
Markus Roberts wrote:
Note new branch name: http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/2661
hmm it conflicts to merge this branch into the latest 0.25.x. I even get json files etc. which I think that shouldn’t be there anymore? Merging the master gives no errors, but I’m unsure wether this one is the right one.
also if I look at the history: http://github.com/MarkusQ/puppet/commits/ticket/0.25.x/2661/ it looks like it’s far away from the upstream 0.25.x branch, isn’t it?
Updated by James Turnbull about 2 years ago
- Status changed from Ready For Checkin to Code Insufficient
This one doesn’t merge either. Not based on 0.25.x I think.
Updated by Markus Roberts about 2 years ago
- Status changed from Code Insufficient to Ready For Checkin
This should be (but then I thought that last time). If it’s not I may need your help resolving the problem.
Updated by James Turnbull about 2 years ago
- Status changed from Ready For Checkin to Closed
Pushed in commit:adc0a4ed939a717e8735485d493bde28ceab5ac0 in branch 0.25.x
Updated by Peter Meier about 2 years ago
James Turnbull wrote:
Pushed in commit:adc0a4ed939a717e8735485d493bde28ceab5ac0 in branch 0.25.x
ok I had this now running for some weeks and I have nearly no problems anymore. However from time to time (maybe once a week) a random client doesn’t remove it’s lockfile, hence future runs fail. I assume this might still happen due to a uncatched exception, however the problem is a) hard or nearly impossible to reproduce and b) it occurs really by random. The only thing I can see in the logs:
Nov 30 19:27:41 foobar puppetd[26228]: Finished catalog run in 98.79 seconds Nov 30 20:00:02 foobar puppetd[3000]: Could not retrieve catalog from remote server: Error 502 on SERVER: ^M502 Bad Gateway ^M ^M^M 502 Bad Gateway
nginx/0.6.39 ^M ^M ^M Nov 30 20:00:03 foobar puppetd[3000]: Using cached catalog Nov 30 20:00:03 foobar puppetd[3000]: Could not retrieve catalog; skipping run Nov 30 20:00:04 foobar puppetd[12169]: Run of Puppet configuration client already in progress; skipping Nov 30 20:30:04 foobar puppetd[21230]: Run of Puppet configuration client already in progress; skipping
as I run puppetd by cron twice an hour with —splay I assume that the run between 19:30 and 20:00 got delayed till 20:00. At this time (20:00) a puppetmaster restart happens and due to that the 502 occured. This was the run of pid 3000, the next run (pid 12169) failed, this could either be as pid 3000 was still running or because there was already no puppetd anymore running and the lock file haven’t been removed. However every future run failed as well as the lockfile wasn’t removed.
Should I file this rather as a new bug report?
Updated by Peter Meier about 2 years ago
I opened a new bugreport as #2888