Feature #10465

Provide a new "obsessive" mode which queries resources again after sync

Added by Oliver Hookins 7 months ago. Updated 4 months ago.

Status:Needs Decision Start date:11/02/2011
Priority:Normal Due date:
Assignee:Nigel Kersten % Done:

0%

Category:provider
Target version:-
Affected Puppet version: Branch:
Keywords:
Votes: 0

Description

Quite frequently there will be cases where the providers think they have done the right thing and report success even though the end result is not successful. This results in continual runs where there are successful changes but the overall outcome is the same – the system state is not what you want it to be.

I would like for there to be a mode you can optionally enable that triggers a second query from the provider after the sync has occurred to see if the desired changes were actually done. If not, trigger a real error (which in fact is just reflecting more accurately the state of the machine than if we were to not perform this checking).

In the case where Puppet is being used for larger orchestrated upgrades this is an essential component to figuring out if the desired changes were completed successfully and thus attention can be turned to the next machine(s) in the workflow.

History

Updated by James Turnbull 7 months ago

  • Category set to provider
  • Status changed from Unreviewed to Needs Decision
  • Assignee set to Nigel Kersten

Updated by Daniel Pittman 7 months ago

Oliver Hookins wrote:

Quite frequently there will be cases where the providers think they have done the right thing and report success even though the end result is not successful. This results in continual runs where there are successful changes but the overall outcome is the same – the system state is not what you want it to be.

I assume you mean things like a service where a call to /etc/init.d/service status returns 0, even though the service failed to start? (Just to make sure we are looking at the same problem. :)

Updated by Nigel Kersten 4 months ago

  • Status changed from Needs Decision to Needs More Information
  • Assignee changed from Nigel Kersten to Oliver Hookins

Situations like this feel like they’re a bug in the providers or the infrastructure they rely on, not in the system itself.

Can you provide more details around these situations Oliver?

Updated by Oliver Hookins 4 months ago

The unfortunate reality is that providers do have bugs and the number of possible situations a provider can encounter is infinite. No amount of combinatorial testing will uncover every possibilty, nor will you be able to code the providers to test for all possible problems they might encounter (e.g. files being set immutable in the filesystem would probably cause problems for most providers, silently).

Given this is the case, but we still want to be aware of the distinction between a failure of a provider to do its job versus a system in constant entropy, reporting back the success of a provider’s attempted changes fills that gap in functionality. The former case might be something like acpid which cannot be started inside an OpenVZ container (even if the init script reports it has been started), and the latter might be a an angry sysadmin who logs into the system and disables acpid – before the next Puppet run.

There are just too many buggy pieces of software to adequately handle, but if Puppet were to implement this feature we’d at least get a more reliable insight into bugs vs natural/unnatural system entropy. Does that make more sense?

Updated by Oliver Hookins 4 months ago

Actually the main use case I had for it was already in my summary. When using Puppet as a deployment tool you really want to be 100% sure you have succeeded on the current machine before moving on to the next. If you accept the end status of a Puppet run as an indication of the success of all of the providers you can easily take down an entire cluster of machines serving an application simply because the init script of your application server incorrectly returned 0 when it shouldn’t have.

Doing some more obsessive checks gives any deployment system wrapping around Puppet a better view of what has actually been achieved on each system and opens up the possibility to stop a potentially devastating upgrade or deployment before it takes all the machines down.

Updated by Nigel Kersten 4 months ago

  • Status changed from Needs More Information to Needs Decision
  • Assignee changed from Oliver Hookins to Nigel Kersten

because the init script of your application server incorrectly returned 0 when it shouldn’t have.

I’m not quite seeing how a second check would actually help this situation Oliver?

Unless you’re suggesting that we have a second check that uses a different method to determine the resource state, a second check is going to return the same result isn’t it?

Updated by Oliver Hookins 4 months ago

No, most (all?) init scripts return a status based on being able to spawn a daemon process successfully, when you call ‘start’. Some probably just assume it has been started correctly if they can retrieve the PID of the child process successfully.

‘Status’ on the other hand should be testing (at very least) whether a process with a pid found in the pid file exists, and quite possibly will attempt to verify some basic IPC with that process to verify it is operating.

In any case, the execution paths for starting and checking the status of a running process are entirely different, and this is the difference I want to expose.

Also available in: Atom PDF