Bug #1812

YAML files corrupted on server (due to high load?)

Added by Nigel Kersten over 1 year ago. Updated 4 months ago.

Status:Closed Start:12/09/2008
Priority:Urgent Due date:
Assignee:James Turnbull % Done:

100%

Category:plumbing
Target version:0.24.7
Affected version:0.24.6 Branch:
Keywords:
Votes: 0

Description

From peter’s mail to puppet-dev

Hi

it looks like it can happen that a node-yaml for a certain node gets broken. I had this now already a small amount of times and every time only a few (2-3) nodes were affected.

So whats the actual problem?

Suddenly I find Log entries like:

Tue Dec 09 15:34:27 +0100 2008 Puppet (err): Could not read YAML data for node foobar: syntax error on line 11, col 14: ` xen_domains: “3”'

in the puppetmaster.log and the master can’t compile the node –> the node therefore won’t get newer manifests, however it looks like the node itself gets in a corrupetd state and is unable to apply a cached manifest.

I can fix this problem by deleting the yaml file of that certain node in $puppetmaster_dir/yaml/node/ .

It often looks like that the master had a high load when this corrupt occurs. However I couldn’t yet find a way to reproduce it, but from discussion in IRC it looks like other people also have randomly this problem. Randomly as it’s not always the same node that has this problem and randomly that it happens very rarely.

So this looks certainly like a bug. However I was unsure if the data I gathered until now might be sufficient to file a bug. As well as I was in this more something-happens-magically-situation I’d rather like to investigate a bit more and maybe even come up with a solution or at least with an idea for a solution.

It looks like the yaml data got broken, as it might have happen due to the highload that there have been problems during the transmission or writing. Deleting the corrupt YAML file fixes the problem and as far as I saw it doesn’t have any impact on the next run of the node. After examining the logs on the master and the client, it looks like the problem first occurs on the master. During the time it happened the first time it might be reasonable that the master had a very high load.

A solution I thought of might be to simply delete the yaml file on the master. The client could then exit with an error (like the present one) and if it rerun the next time everything would be fine. But this might be not the right way to fix. As I can’t yet see when the yaml file is transferred, nor what the actual impact it has on compiling the manifest etc. I mean we could also simply delete it and restart again the client-run procedure (if that is possible), so we can fix the problem within a client-run (maybe with a max retries of 3). Another option might be to check if the yaml data get stored correctly and if not and if the yaml in the memory is still correct rewrite it, otherwise request it again from the client. Another idea I had is that it might be a problem in the yaml lib of ruby or whatever.

So do you guys think if this is certainly a bug and what would be the best location to look for the actual problem and what might be the best solution for it?

Testing the solution would be very easy: simply corrupt the yaml file and see if puppet behaves the expected way. However I’m yet really unsure how to reproduce the actual cause.

thanks for additional ideas or information. If I have a more concrete idea what might be the actual source of the problem and what might be the best way to fix the problem I’m more confident to file a bug.

cheers pete

Corroborated by myself and Oliver Hookins


Related issues

related to Puppet - Bug #3370: YAML read/write locks broken on OpenBSD Needs more information 03/15/2010

Associated revisions

Revision 45144a1b9da2839fd9f8a182a8f82ecb06e17292
Added by Luke Kanies over 1 year ago

Fixing #1812 (hopefully) – adding read and write locks to yaml.

It’s obviously not really possible to test that this fixes it, but I’m confident that the locks work, and now we’re using them, so it should.

Signed-off-by: Luke Kanies luke@madstop.com

Revision cf19bd8dea141a59cdff5a7f1edc56d3620ab0e2
Added by Luke Kanies over 1 year ago

Not using a temporary file when locking files for writing.

The temporary file was not actually useful, because we could never really get atomic renames, for annoying, complicated reasons.

This hopefully finally fixes #1812.

Signed-off-by: Luke Kanies luke@madstop.com

History

Updated by Nigel Kersten over 1 year ago

And lak’s response

This is likely the problem.

I’ve not worried about locking the node files (or any of the yaml files) because they shouldn’t be rewritten, generally. It’s obviously happening, though.

Another thing that could be a problem, which doesn’t require a corrupt file, is that the file is read while it is being written. I guess I’d expect this to happen a bit more often.

Updated by Nigel Kersten over 1 year ago

I have some examples of corrupt files if that helps debug the issue by the way. I should be able to sanitize them to post publicly if that would help.

Updated by Luke Kanies over 1 year ago

  • Category set to plumbing
  • Status changed from Unreviewed to Accepted
  • Assignee set to Luke Kanies
  • Priority changed from Normal to Urgent
  • Target version set to 0.24.7

I think it’s pretty obvious where the failures are happening, so we just need to protect those. I don’t think examples are necessary, but thanks.

I’m assigning this to 0.24.7, but it’s up to James as to whether he’d let it be merged in, once I fix it.

Updated by James Turnbull over 1 year ago

If it’s complete shortly then I am happy for it to go into 0.24.7 – it looks serious enough.

Updated by Luke Kanies over 1 year ago

  • Status changed from Accepted to Ready for Checkin
  • Assignee changed from Luke Kanies to James Turnbull

I’ve got fixes for this in the tickets/0.24.x/1812 branch in my repo.

Updated by James Turnbull over 1 year ago

  • Status changed from Ready for Checkin to Closed

Pushed in commit:“45144a1b9da2839fd9f8a182a8f82ecb06e17292” in branch 0.24.x

Updated by Luke Kanies over 1 year ago

  • Status changed from Closed to Re-opened
  • Assignee changed from James Turnbull to Luke Kanies

This still needs a little work.

Updated by Luke Kanies over 1 year ago

I’ve decided to forego the tmp file. It’s there to provide atomicity, but we get half of our atomicity from the write lock (no one can read an inconsistent state), and the other half is unavailable regardless of what we do (if someone kills us while we’re writing the file, we can’t recover enough to fix the original file).

As written, the current code will empty the existing file if there’s an exception. That will still happen, but I don’t see a way around it, and this way the code is much cleaner, and atomic renames don’t work if you don’t have write access to the directory. Yay.

Updated by Luke Kanies over 1 year ago

  • Status changed from Re-opened to Ready for Checkin
  • Assignee changed from Luke Kanies to James Turnbull

Fixed (again) in tickets/0.24.x/1812 in my repo (a new branch, make sure it does a non-fast-forward pull).

Updated by Luke Kanies over 1 year ago

  • Status changed from Ready for Checkin to Closed
  • % Done changed from 0 to 100

Applied in changeset commit:“cf19bd8dea141a59cdff5a7f1edc56d3620ab0e2”.

Also available in: Atom PDF