Making sense of crashes with SmartStop

December 20th, 2009 by pete Leave a reply »

Last time I talked about how it is that Jinx can make your bugs happen faster.  That’s pretty helpful, but it isn’t the whole picture.  Some bugs are hard to reproduce and easy to understand, and some are easier to reproduce but harder to understand.  How does Jinx help with hard-to-understand bugs?

The state of the art with regard to debugging concurrency errors has improved somewhat, but for multi-threaded programs it still typically involves visiting a pile of threads to see which may be involved with the problem.  It’s a breadcrumb party… great for bonding with your fellow developers and producing war stories.  Less great for shipping your code.  Various things make the bug harder to understand in this scenario.

What I’ll call “overshoot” is primary among these.  Overshoot is a phenomenon where one thread or task in your program, corresponding to a single CPU at the point of failure, runs for a long time after the error has been detected.  The detecting thread traps into the operating system, which sends out messages to other CPUs saying “stop any thread in the following process immediately”.  A nice clean operating system might be able to deliver a stop within a few tens of microseconds, increasing as the number of processing cores in your computer increases.  Sadly, that few microseconds corresponds to a few tens of thousands of machine instructions.  That’s a lot of overshoot.  To understand how that kills your ability to find bugs, let’s look at a few examples.  We’ll start with an example where a normal debugger doesn’t do a horrible job.  It’s the example from last time, and you may remember that it corresponds to code that looks like this:

    // Thread 1 (corrupter)
    float myradius = 1;
    lock();
    circle->radius = myradius;
    unlock();
    lock();
    circle->area = PI * myradius * myradius;
    unlock();

    // Thread 2 (verifier)
    lock();
    assert(circle->radius * circle->radius * PI == circle->area);
    unlock();

An atomicity violation as seen by the machine.

An atomicity violation as seen by the machine.

Because we have now added the assert, the verifier thread now will never unlock, and consequently the corrupter will never perform the update of the area.  From the perspective of a developer, your program crashes, one thread is complaining about an inconsistent structure, and one or more threads are trying to acquire the lock on the structure.  You may notice or it may be obvious that one of these threads has just updated the structure in a non-atomic way.  Because the proximity is probably pretty good, this might not be a hard-to-understand bug.

Things get more interesting, and unpleasant, when we consider a low-level data race.  We take a similar example, but rather than showing an atomicity violation, we make it into a data race, by eliminating the locking in the corrupter altogether.

    // Thread 1 (corrupter)
    float myradius = 1;
    circle->radius = myradius;
    circle->area = PI * myradius * myradius;
Instead of an atomicity violation, a low-level data race.  The corrupter thread never takes a lock at all.

Instead of an atomicity violation, a low-level data race. The corrupter thread never takes a lock at all.

What is the developer’s experience in this case?  Well, after the assert is detected, an operating system call is made, which stops all threads currently running inside the program in question.  This is if you’re lucky.  On more primitive OSes you might have to wait until your timer tick is up.  Regardless, it’s pretty plausible that the corrupter thread in this case has done two horrible things:

  • It has advanced thousands of instructions so that it’s nowhere near the problem location.  You probably won’t even know what the culprit thread is.
  • It has repaired the state of the circle structure so that the invariant has been repaired.  This is that second STORE in the diagram above.  That’s the corrupter thread cleaning up its mess before going to hide in some other far flung piece of code.

In this case, the developer has one useful piece of information.  That the invariant has failed once.  Since we started Petra, we’ve talked to many people who have seen invariants fail and never figured out why.

One more example before we discuss solutions to this mess.  A wakeup-race is what we call an ordering violation wherein one thread is the cause of the start of activity on another thread. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics describes a common bug pattern found a few times in firefox, which involves the creation of a new thread, and the assignment of its ID to a global.

    // Thread 1
    g_newThread = CreateThread(myfunc);

    // Thread 2
    myfunc() {
        assert(g_newThread != INVALID_THREAD);
    }

Sometimes we get away with it, as below.  In the case of Firefox, this was presumably a very rare bug.

This latent ordering violation isn't revealed in this example, as the creator thread runs first.

This latent ordering violation isn't revealed in this example, as the creator thread runs first.

But, sometimes, we don’t get away with it.  As you would expect, Jinx is good at making this sort of bug happen.

The unlucky case, where the awoken thread runs first, observing inconsistent state.

The unlucky case, where the awoken thread runs first, observing inconsistent state.

The unlucky case is, again, very bad for the programmer.  After the bug is detected, there is a long window until the thread that stored the thread id late gets stopped.  During that time it can return to its event loop, enter some other unrelated code, and, in the case shown, set g_newThread to be valid, thus obfuscating the state of the program at fault time.  The basic problem here is that in overshoot siutations

  1. The threads keep running.
  2. They overwrite bits of global memory state as they go, hiding the bug or further obfuscating it.

So, for the three major classes of bugs, the experience of debugging a crash varies from bad to horrible.  How can we make this better?

The obvious answer is that we have to stop threads earlier.  One thing we could do is just sequentialize the execution, and stop all threads as soon as one of them detects a problem.  This would be an improvement, but it’s not that great.  There are a couple of reasons for this.

  1. The gap between the load of the wrong data value and the detection of the bug may be long.  Say, for example, instead of computing πr² we were computing sin(acos(log²(n)))… you’re thousands of cycles in on all threads before you know there’s a problem.
  2. The gap between the application detecting an error and Jinx picking it up may be large, also.  For example, the application may call into the OS as a part of its abort sequence.

Fortunately, there are better ways to solve this problem.  Jinx takes the (patent-pending) approach of advancing each of the threads involved in the crash the minimum computational distance forward before stopping them.  That is, we only want each thread to run far enough forward to allow the crash to happen.  One approximation of this point is the last communication point between threads.  What does it mean to compute the last communication point?  For two threads, the simple definition is that it’s the last time before the crash that a non-crashing thread wrote something that was read by the crashing thread before the crash.  An example is probably easiest.

There is no reason to allow other threads in the process to advance beyond the last time they communicated with the crashing thread.

There is no reason to allow other threads in the process to advance beyond the last time they communicated with the crashing thread.

First, we take the example of the low-level data race.  The last thing that was written by the corrupter thread that was still read by the crashing thread before the crash is radius, in the course of the “STORE radius” instruction.  Simply by terminating execution of the corrupter thread at its last communication point with the verifier thread (immediately after this instruction), we provide the following user experience:

  1. The verifier thread is stopped on the assert line, as before.
  2. The corrupter thread is stopped on the next instruction after writing circle->radius = 1;

That’s a pretty cool result.  You’re stopped at the debugger, and all of the participating threads are in a meaningful location.  What about the other really thorny case, the ordering violation?

Stopping on last communication, applied to an ordering violation.

Stopping on last communication, applied to an ordering violation.

Here, the last communication is virtual:  it’s where the one thread creates the other, which is a virtual communication.  In Jinx’s case this is trapped at the IPI (interprocessor interrupt) sent from one CPU to another.  Since no memory communications occur after that, it’s not necessary to allow the thread to run past that point.  Again, the user experience is pretty good:  one thread is stopped creating another, and that thread is referencing the illegal global.  The value of g_newThread is still INVALID_THREAD at the point of the crash.

You’ll note that it’s only possible to do any of this with Jinx’s simulation approach.  Only by running executions multiple times can you perform the kind of analysis necessary to do this sort of thing, and you can only do that with Jinx.  You might imagine we’re pretty excited about SmartStop.  It will be present in our beta in late January or early February, and we’ll release a video of it in action as a part of the beta launch.  We hope you’ll sign up for the beta and let us know what you think of this feature and the rest of Jinx!

Leave a Reply