Bug #2294

Classes sometimes cannot be found

Added by Luke Kanies over 1 year ago. Updated 4 months ago.

Status:Closed Start:05/22/2009
Priority:Normal Due date:
Assignee:Markus Roberts % Done:

0%

Category:language
Target version:0.25.0
Affected version:0.25.0beta1 Branch:
Keywords:
Votes: 3

Description

This kind of error occurs multiple times over the course of a week at a client of mine:

puppetmasterd err puppetmasterd[11913]: Could not find class allusers::charles in namespaces allusers::sss at /etc/puppet/modules/allusers/manifests/sss.pp:15 on node d3i002.view.domain.com

I believe it is the result of some kind of race condition or timeout, as I’ve been able to force it to happen under load, but only then.

race_patch (5.5 KB) Markus Roberts, 07/24/2009 11:53 pm


Related issues

related to Puppet - Refactor #2387: The Parser should use a separate class to manage loaded code Closed 07/02/2009

Associated revisions

Revision 11c0fb77230ea5ba28bfe86a1c2a1469095b6c70
Added by Markus Roberts about 1 year ago

Fixed #2294 – Classes sometimes cannot be found

This patch should fix the race condition causing ticket 2294; it extends the loading logic so that:

* initial load attempts are processed (as before),
* recursive load attempts return immediately (as before),
* but subsequent concurrent load attempts from different threads
  wait on a semaphore (condition variable) and then retry (e.g.
  use the now-valid results of the first thread).

This is a slight modification of the solution I’d originally proposed, to prevent a deadlock that could have arisen if three or more threads simultaneously attempted to load the same item. Though it solves the bug as reported, it has room for improvement:

* Failures aren't cached, so repeated attempts will be made to
  import invalid items each time they are encountered
* It doesn't address any of the underlying referential ambiguity
  (module vs. filename)
* The threading logic should probably be refactored into a separate
  class (as a start I encapsulated it in an ad hoc singleton class,
  so at least it isn't cluttering up the load method)

Signed-off-by: Markus Roberts Markus@reality.com

History

Updated by Brice Figureau about 1 year ago

It’s not the root cause, but might participate to the mess: in parser_support.rb, method load:

        mod = filename.scan(/^[\w-]+/).shift
        unless @loaded.include?(mod)
            @loaded << mod
            begin
                import(mod)
                Puppet.info "Autoloaded module %s" % mod
            rescue Puppet::ImportError => detail
                # We couldn't load the module
            end
        end

Notice that we add to @loaded the module (and later the filename), even though we could have failed importing the module. The net result is that we won’t ever try again to reload this, except if the parser is recreated.

Updated by Luke Kanies about 1 year ago

I’m also working on extracting out all of the @astset stuff into a separate class with accessors that can be controlled through a mutex, so we can see if that’s the problem.

I agree, though, that the loading (which is some of the oldest code in Puppet) is probably doing stupid things. I’ll fix the code you pointed out, too, and at least fix that bit.

Updated by Luke Kanies about 1 year ago

I don’t know if it will help us fix the problem, but I extracted the code related to managing the hashes of classes and definitions into a new class, LoadedCode.

It’s in the tickets/master/2294 branch in my repo.

I haven’t yet tried to put mutexes in the places that might protect us from a race condition.

Updated by Brice Figureau about 1 year ago

Luke Kanies wrote:

I don’t know if it will help us fix the problem, but I extracted the code related to managing the hashes of classes and definitions into a new class, LoadedCode.

It’s in the tickets/master/2294 branch in my repo.

I haven’t yet tried to put mutexes in the places that might protect us from a race condition.

Before you add the mutexes, check that the bug is still present :–) Next, add instrumentation/thread specific debugging. This way we’ll gather more valuable information (I’m OK to analyse/trace pile of logs).

Note: the branch doesn’t exists in your github repo. I think you forgot to push it :–)

Updated by Luke Kanies about 1 year ago

Seems to be there for me:

http://github.com/lak/puppet/tree/tickets/master/2294

Updated by Joe McDonagh about 1 year ago

  • Affected version changed from 0.24.8 to 0.25.0beta1

Hey guys, I got this problem last night when trying out .25beta for the first time. The message was almost the same; I had a class basenode_tmpl including another class phishnet::aptrepo, which was causing this namespace error. When I moved the include for phishnet::aptrepo into the node definition, it worked. Then when I moved it back to the basenode_tmpl class, it still worked. I was using puppetd -t so it shouldn’t have been caching. I can give you more details and logs if you want when I get home tonight.

Updated by Luke Kanies about 1 year ago

This is affecting multiple people, but it’s clearly a horrible race condition and is thus very difficult to track down, and it’s relatively easy to work around.

I’ll move the existing code I have, which just refactors a lot of this and might actually fix it, into a separate ticket and leave this as normal priority.

Updated by Joe McDonagh about 1 year ago

Hey Luke, I managed to achieve this today when I simultaneously restarted puppetd across 30 some odd nodes. Random classes not being able to be found, and I’m not sure if this might help but this also happened and it’s new to me:

Jul 14 11:18:40 puppet puppetmasterd[25279]: ‘newexec’ method already exists; skipping Jul 14 11:18:45 puppet puppetmasterd[25279]: ‘newhost’ method already exists; skipping Jul 14 11:18:50 puppet puppetmasterd[25307]: ‘newcron’ method already exists; skipping

This was on my prod .24.8 setup at work.

Updated by Markus Roberts about 1 year ago

  • Assignee changed from Luke Kanies to Markus Roberts

I’ve managed to confirm that this is a race condition with a test rig that forces ordering. I have a fix in the works.

Updated by Markus Roberts about 1 year ago

The attached patch should fix the race condition; it extends the loading logic so that: * initial load attempts are processed (as before), * recursive load attempts return immediately (as before), * but subsequent concurrent load attempts from different threads wait on a semaphore (condition variable) and then retry (e.g. use the now-valid results of the first thread).

This is a slight modification of the solution I’d originally proposed, to prevent a deadlock that could have arisen if three or more threads simultaneously attempted to load the same item.

Though it solves the bug as reported, it has room for improvement: * Failures aren’t cached, so repeated attempts will be made to import invalid items each time they are encountered * It doesn’t address any of the underlying referential ambiguity (module vs. filename) * The threading logic should probably be refactored into a separate class (as a start I encapsulated it in an ad hoc singleton class, so at least it isn’t cluttering up the load method)

Updated by Luke Kanies about 1 year ago

  • Status changed from Accepted to Ready for Testing

Updated by Brice Figureau about 1 year ago

Markus Roberts wrote:

I’ve managed to confirm that this is a race condition with a test rig that forces ordering. I have a fix in the works.

Can you explain (for me and the records) how/where/when/why the race triggers? Can you post the patch (if you didn’t already) to the puppet-dev list (it’d be easier to read)?

Updated by Markus Roberts about 1 year ago

In the old code we had this possibility:

  • Thread one needs a module
  • It looks and sees that it hasn’t been loaded yet
  • Because the module it might refer to itself (and we don’t want to recuse forever) the routine marks it as loaded before it starts loading it.
  • It starts loading the module
  • Before the first thread is finished, another thread needs the module
  • It looks and sees (erroneously) that the module has already been loaded
  • Hilarity ensues

The fix was to retain the old behavior for the loading thread but to suspend any additional threads looking for the same module until the first has completed. This has close to zero performance impact (and is far faster than, say, making each thread load their own copy), and resolves the race condition.

Branch up at http://github.com/MarkusQ/puppet/tree/ticket/master/2294

Updated by Luke Kanies about 1 year ago

  • Status changed from Ready for Testing to Ready for Checkin
  • Assignee changed from Markus Roberts to James Turnbull

Updated by James Turnbull about 1 year ago

  • Assignee changed from James Turnbull to Markus Roberts

Markus – could I get you to please rebase this against HEAD. Thanks!

Updated by James Turnbull about 1 year ago

  • Status changed from Ready for Checkin to Closed

Pushed in commit:“11c0fb77230ea5ba28bfe86a1c2a1469095b6c70” in branch master.

Also available in: Atom PDF