Bug #2029

node yaml file can get corrupted

Added by Martin Englund over 1 year ago. Updated 4 months ago.

Status:Closed Start:02/26/2009
Priority:Normal Due date:
Assignee:Luke Kanies % Done:

0%

Category:plumbing
Target version:0.25.0
Affected version:0.24.7 Branch:
Keywords:
Votes: 0

Description

A few times a week some of my node’s yaml files become corrupted:

Thu Feb 26 01:51:39 -0800 2009 //Node[default]/network/File[/etc/inet/netmasks] (err): Failed to retrieve current state of resource: Could not parse YAML data for node xxx.sun.com: syntax error on line 45, col 8: `  zones: global' Could not describe /files/etc/inet/netmasks: Could not parse YAML data for node xxx.sun.com: syntax error on line 45, col 8: `  zones: global' at /puppet/config/manifests/classes/network.pp:22

And when I take a look in the yaml file, it is easy to spot the corruption:

 ipaddress_e1000g0:1: 10.6.48.226
 sshrsakey: AAAAB3NzaC1yc2ExxxxxIwAAAIEA1dvidJlovk3aqsMmMmgn7d30BLne9I0wwTVlBNcM0vjISWqWQG7LVRp2cEEkfH/s0PNIj+/Mut14FWqSMxxe3sYKZNnvCJwxOsaHAqpOPCZnugsPvfVKRcFatxxxxxIIOx4aIHGcZSxetVEobErzMTfSUc0B9paXgZ+qm4YaWTU=
 operatingsystem: Solaris
 sshdsakey: AAAAB3NzaC1kc3MAAACBAP+jc9R9G2TCT4+m6LMBq/ZzKXNuMkg4Mv3KiU/Ob/SJf0Pd1OzNHi4xxWSow2sELmpicI/ywt0sCsEEdIXYNcKDe1YSkpF4H/h1qdiZfbMPIzalzqPZkHpt40rg93fpgAMY9ummM7OWRHeYdeyLxUEwwTIIza+/C6JOoT+afLmbAAAAFQDglf4ErQZF6lKed4bpOJh+OAlgFwAAAIBHRrxxxxxLh0LbSGwDSchfLF1GBLx30usAqW8PMkF3VH14V+nbqwnD/Knf3qs/Bf/xneUnIL2rgb6bryZw+FRaG8SwKlCw/Iy7AkcMshb/mW2zu+4K5/B2Z5ZAFX3WFDvHeJanxug76UwxQZzzDYx1sopoEV4rsXFJpoSd5mWF7AAAAIEA/S7BebFGrRMy8zzIISGcF8wBiSJI/5Xln3lDPgClVDtPwawb1UUNi15NM975/u5BdGFhKrVtpnZQDgaqQxmULQ+m7mufP2p+3XHqcZH3uDXbz92cMWi/Udww/SJJc/cqyGSJsEjNeVUTVzARqwxxxxxSfVIDw7p2U5LF/GWkOF0=
 rubyversion: 1.8.7
time: 2009-02-25 16:08:50.020999 -08:00
MmMmgn7d30BLne9I0wwTVlBNcM0vjISWqWQG7LVRp2cEEkfH/s0PNIj+/Mut14FWqSMxxe3sYKZNnvCJwxOsaHAqpOPCZnugsPvfVKRcFat0Xi4LIIOx4aIHGcZSxetVEobErzMTfSUc0B9paXgZ+qm4YaWTU=
 operatingsystem: Solaris
 sshdsakey: AAAAB3NzaC1kc3MAAACBAP+jc9R9G2TCT4+m6LMBq/ZzKXNuMkg4Mv3KiU/Ob/SJf0Pd1OzNHi4xxWSow2sELmpicI/ywt0sCsEEdIXYNcKDe1YSkpF4H/h1qdiZfbMPIzalzqPZkHpt40rg93fpgAMY9ummM7OWRHeYdeyLxUEwwTIIza+/C6JOoT+afLmbAAAAFQDglf4ErQZF6lKed4bpOJh+OAlgFwAAAIBHRrD0lI3Lh0LbxxxxxchfLF1GBLx30usAqW8PMkF3VH14V+nbqwnD/Knf3qs/Bf/xneUnIL2rgb6bryZw+FRaG8SwKlCw/Iy7AkcMshb/mW2zu+4K5/B2Z5ZAFX3WFDvHeJanxug76UwxQZzzDYx1sopoEV4rsXFJpoSd5mWF7AAAAIEA/S7BebFGrRMy8zzIISGcF8wBiSJI/5Xln3lDPgClVDtPwawb1UUNi15NM975/u5BdGFhKrVtpnZQDgaqQxmULQ+m7mufP2p+3XHqcZH3uDXbz92cMWi/Udww/SJJc/cqyGSJsEjNeVUTVzARqwJdljNSfVIDw7p2U5LF/GWkOF0=
 rubyversion: 1.8.7
time: 2009-02-25 16:08:50.020999 -08:00

Related issues

related to Puppet - Bug #1969: yaml files are looked at first before stored config entries? Rejected 02/13/2009
duplicated by Puppet - Bug #2299: Node YAML files being corrupted on puppet master server Duplicate 05/24/2009

Associated revisions

Revision 7398fa171fdd6dcaeb2d8fd1c07a23bbd78891d0
Added by Luke Kanies over 1 year ago

Partially fixing #2029 – failed caches doesn’t throw an exception

If the main terminus fails you get an exception, but not if a cache terminus fails.

Signed-off-by: Luke Kanies luke@madstop.com

History

Updated by James Turnbull over 1 year ago

  • Category set to plumbing
  • Status changed from Unreviewed to Accepted
  • Assignee set to Luke Kanies
  • Target version set to 0.24.8

Updated by Luke Kanies over 1 year ago

  • Status changed from Accepted to Needs more information

You’re sure this is in 0.24.7? We added some specific protections, including using a lock file, for writing and reading these files, so they should be safe.

Unless you think it’s not a threading issue and something else is causing the problem?

We don’t use a temp file for writing because I couldn’t find a method that would give us atomic renames while retaining the locks.

Updated by Josh Anderson over 1 year ago

I’m also experiencing this with 0.24.7. I am, however, running multiple puppetmasterd processes on my master. Could that be the cause of the corruption?

Updated by Martin Englund over 1 year ago

I’m very sure I’m running 0.24.7 :)

I’m only running one puppetmasterd as well. Let me know if you want add debugging code to it…

Updated by Luke Kanies over 1 year ago

Any chance it could be happening when the server is shutting down, or something similar?

What’s the frequency of the corruption?

I’m nearly positive it’s not a concurrency issue, which doesn’t leave a lot of other options.

Updated by Martin Englund over 1 year ago

It happens during normal operations. The server process has been running for weeks.

I get about 2 failures per week. I’ll start to document when they happen (and any suspicious circumstances)

Updated by Luke Kanies over 1 year ago

  • Subject changed from node yaml file can get corrputed to node yaml file can get corrupted

Updated by Martin Englund over 1 year ago

Got another node yaml file corrupted today.

Updated by Luke Kanies over 1 year ago

I’m quite stumped on this one. The only way I can see for the files to get corrupt is, maybe, if the server is getting stopped while writing to a file. It doesn’t look like that’s what’s happening, though.

Anyone else have any ideas?

Updated by Luke Kanies over 1 year ago

I’m bumping this unless we can actually figure out what the problem is.

Updated by Luke Kanies over 1 year ago

  • Target version changed from 0.24.8 to 2.6.0

Updated by Martin Englund over 1 year ago

I’m now getting about 3 corrupted file per day!

Updated by David Escala over 1 year ago

This node/yaml corruption has hit us too.

I don’t know why this happens, why the yaml file gets messy, but a corrupted cache should be expired and fetched from its origin again. An invalid cache entry should not stop a node from getting the catalog.

Updated by David Escala over 1 year ago

Patch here http://github.com/descala/puppet/commit/cf6febe82221a99317a186dc34aa84996ffb381f

Updated by Luke Kanies over 1 year ago

  • Target version changed from 2.6.0 to 0.25.0

We’ll at least get the fix (or some kind of fix) into 0.25.

Updated by Martin Englund over 1 year ago

That did wonders for me :)

With this fix in place I’m fine with closing this bug…

Updated by Luke Kanies over 1 year ago

  • Status changed from Needs more information to Ready for Checkin

I applied a form of the patch in the tickets/master/2029 branch in my repo.

Note that this doesn’t fix the corruption, just the cache failure propagating.

Updated by James Turnbull over 1 year ago

  • Status changed from Ready for Checkin to Closed

Pushed in commit:“7398fa171fdd6dcaeb2d8fd1c07a23bbd78891d0” in branch master.

Also available in: Atom PDF