Bug #3362

splay drift occurs when passenger/mongrel get too much load.

Added by Dan Bode almost 2 years ago. Updated 5 months ago.

Status:Needs More Information Start date:03/11/2010
Priority:Low Due date:
Assignee:Nigel Kersten % Done:

0%

Category:plumbing
Target version:2.7.x
Affected Puppet version:0.25.4 Branch:http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3362
Keywords:passenger load splay mongrel connection timeouts
Votes: 1

Description

not sure if this counts as a bug…

I could not concretely prove the assumptions below. I did some investigation and this is my best guess as to the cause.

Splay was drifting for hundreds of machines so that over time, most were checking in at the same time, while at other times none were checking in. here is my theory as to why.

splay only runs the first time after puppet starts.

Assumption: runinterval starts counting only after the client finishes its last run?

Here is the chain of events that I think causes this:

  1. passenger or mongrel is under heavy load.
  2. processes get used up, they start queuing hosts.
  3. Once a machine falls into the queue, it gets stuck with the group of machines that cause the queue to fill up, since it will now use runinterval and check in at the same time as the other machines that were running at that same time.
  4. Over time, splay drifts so that most machines are checking in at the same time.

Basically, once performance starts getting bad, the splaying falls apart so that it gets much worse.

History

Updated by James Turnbull almost 2 years ago

  • Category set to plumbing
  • Status changed from Unreviewed to Investigating
  • Target version set to 2.7.x

Updated by Markus Roberts almost 2 years ago

  • Branch set to http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3362

Branch up for testing; minimal modification to cause ¼ splaylimit resplaying with each run to prevent clumping.

Updated by Markus Roberts almost 2 years ago

Backport of the change on http://github.com/MarkusQ/puppet/tree/ticket/0.24.8/3362

Updated by Mark Plaksin almost 2 years ago

Can we have an option to disable this change? We have adapted to how splay currently works and our client checkins are almost perfectly spread out. We have 400+ clients and have 15-20 checkins each minute. Last time splay changed we spent a lot of time re-adjusting to spread out the load.

This change might be awesome and work perfectly for us (and everybody else) but it might have unanticipated consequences so a switch to leave things as they were before the change would be great.

Updated by Markus Roberts almost 2 years ago

Fear not —

1) this is not presently a proposed change in behaviour, merely a probe to try to sort out the problem and test the main theory as to its cause

2) if it turns out that such functionality was needed, it would be off by default and require explicitly setting an option to enable it

— Markus

Updated by Dan Bode over 1 year ago

the customer that we worked on this with has verified that they have used the patch and identified that it resolved their splay issue.

Updated by James Turnbull over 1 year ago

  • Assignee set to Markus Roberts

Updated by Markus Roberts over 1 year ago

  • Status changed from Investigating to Accepted

Updated by James Turnbull 12 months ago

  • Status changed from Accepted to In Topic Branch Pending Review

Updated by James Turnbull 6 months ago

  • Status changed from In Topic Branch Pending Review to Needs Decision
  • Assignee changed from Markus Roberts to Nigel Kersten

This probably needs review and comment from you Nigel.

Updated by Nigel Kersten 5 months ago

  • Status changed from Needs Decision to Needs More Information
  • Priority changed from Normal to Low

Have we been able to confirm Dan’s original assumptions?

I’m not seeing widespread clamor over this issue other than Mark’s request to not change existing behavior :)

Also available in: Atom PDF