Bug #3362
splay drift occurs when passenger/mongrel get too much load.
| Status: | Needs More Information | Start date: | 03/11/2010 | |
|---|---|---|---|---|
| Priority: | Low | Due date: | ||
| Assignee: | % Done: | 0% |
||
| Category: | plumbing | |||
| Target version: | 2.7.x | |||
| Affected Puppet version: | 0.25.4 | Branch: | http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3362 | |
| Keywords: | passenger load splay mongrel connection timeouts | |||
| Votes: | 1 |
Description
not sure if this counts as a bug…
I could not concretely prove the assumptions below. I did some investigation and this is my best guess as to the cause.
Splay was drifting for hundreds of machines so that over time, most were checking in at the same time, while at other times none were checking in. here is my theory as to why.
splay only runs the first time after puppet starts.
Assumption: runinterval starts counting only after the client finishes its last run?
Here is the chain of events that I think causes this:
- passenger or mongrel is under heavy load.
- processes get used up, they start queuing hosts.
- Once a machine falls into the queue, it gets stuck with the group of machines that cause the queue to fill up, since it will now use runinterval and check in at the same time as the other machines that were running at that same time.
- Over time, splay drifts so that most machines are checking in at the same time.
Basically, once performance starts getting bad, the splaying falls apart so that it gets much worse.
History
Updated by James Turnbull almost 2 years ago
- Category set to plumbing
- Status changed from Unreviewed to Investigating
- Target version set to 2.7.x
Updated by Markus Roberts almost 2 years ago
- Branch set to http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3362
Branch up for testing; minimal modification to cause ¼ splaylimit resplaying with each run to prevent clumping.
Updated by Markus Roberts almost 2 years ago
Backport of the change on http://github.com/MarkusQ/puppet/tree/ticket/0.24.8/3362
Updated by Mark Plaksin almost 2 years ago
Can we have an option to disable this change? We have adapted to how splay currently works and our client checkins are almost perfectly spread out. We have 400+ clients and have 15-20 checkins each minute. Last time splay changed we spent a lot of time re-adjusting to spread out the load.
This change might be awesome and work perfectly for us (and everybody else) but it might have unanticipated consequences so a switch to leave things as they were before the change would be great.
Updated by Markus Roberts almost 2 years ago
Fear not —
1) this is not presently a proposed change in behaviour, merely a probe to try to sort out the problem and test the main theory as to its cause
2) if it turns out that such functionality was needed, it would be off by default and require explicitly setting an option to enable it
— Markus
Updated by Dan Bode over 1 year ago
the customer that we worked on this with has verified that they have used the patch and identified that it resolved their splay issue.
Updated by James Turnbull over 1 year ago
- Assignee set to Markus Roberts
Updated by Markus Roberts over 1 year ago
- Status changed from Investigating to Accepted
Updated by James Turnbull 12 months ago
- Status changed from Accepted to In Topic Branch Pending Review
Updated by James Turnbull 6 months ago
- Status changed from In Topic Branch Pending Review to Needs Decision
- Assignee changed from Markus Roberts to Nigel Kersten
This probably needs review and comment from you Nigel.
Updated by Nigel Kersten 5 months ago
- Status changed from Needs Decision to Needs More Information
- Priority changed from Normal to Low
Have we been able to confirm Dan’s original assumptions?
I’m not seeing widespread clamor over this issue other than Mark’s request to not change existing behavior :)