Last time I talked about dealing with non-deterministic bugs plus corruption by either reducing non-determinism or by turning corruptions into failures. This time I’ll talk about the machinery that allows Jinx to amplify repro rates for bugs, both crashing and correctness.
First off, how does Jinx find crashing bugs? It runs simulations of system state in parallel, and chooses interesting ones to “retire.” The following diagram illustrates this process.

At time A, Jinx takes a checkpoint, and conducts simulations until finding a bug in a simulation ending at point B. It chooses this simulation to retire into reality. This process is completed at C, and normal execution resumes.
After taking a checkpoint at A, Jinx starts by conducting a first exploratory simulation, and based on the result of this preliminary simulation conducts zero or more follow-on simulations, scoring these scenarios as they complete. Our life is straightforward in the case where we find a crash, as in B. We assign a high score to the crashing simulation, and later on we choose this simulation to retire (our jargon for “commit” or “turn into reality”). Retirement completes at C. Typically we only retire bugs that don’t happen in the kernel or device drivers… most developers don’t want their boxes bluescreened!
So you run your program once, and Jinx runs parts of it tens or hundreds of times behind the scenes, choosing for retirement the simulations that are most likely to result in bugs. The performance impact is mitigated by running the simulations in parallel across all the tens (soon hundreds) of mostly idle cores in that teraflop multi-core machine you just bought.
We have a pretty elaborate scoring mechanism to encourage the retirement of bugs (when we retire a bug, we make it really happen, which is what we want). At the machine layer, INT3, #DB, #BP, #PF, and explicit hypercalls like jinx_assert() can all suggest to Jinx that a simulation is a good thing to replay. We don’t get this data for correctness bugs, where the bug isn’t caught during execution, but pollutes the output with incorrect data. So, which simulation to make into reality is not so obvious. However, Jinx is great at making correctness bugs happen more often. Why?
The first thing to understand is the scales of time that we’re talking about. If we were to divide time into seconds, then in some one-second intervals both sides of a bug would occur… for concurrency errors, you have to have two parties involved, by definition. So, for example, not only does someone update a structure non-atomically, but someone else chances to read it within the same second. For structures updated seldom, only in a fraction of the one-second intervals will the potential bug occur. For the sake of example, let’s say a bug’s two parties occur in the same second, in one of every ten seconds.

During each one second interval, either one or both parties to a bug may run. The two parties occur only in second 6 in this example.
Now, within the second’s duration, let’s say we’ve committed the most original of atomicity violation sins: the non-atomic update of radius and area for the canonical circle.
float myradius = 1;
lock();
circle->radius = myradius;
unlock();
lock();
circle->area = PI * myradius * myradius;
unlock();
In diagram form, with another thread potentially observing the state of the circle.

The corrupter thread introduces an atomicity violation by updating radius and area in separate transactions. The observer thread is likely to observe the inconsistency if it attempts to take the lock anywhere during the race window.
In order for the second thread to observe the problem, it has to begin its attempt to lock the structure during time period A or B. As the critical section A becomes smaller, the likelihood of the second thread attempting to lock during that period approaches zero. The more optimized the critical section in this case, the more rare the bug becomes. There’s a lower bound on how small the critical section can become (all instructions take time, and a locked instruction takes quite a bit of time). Let’s say that the critical section takes 1000 cycles, which is pretty typical for a lock and unlock, and you’re on a 1Ghz CPU. If you know that cause and detection occur in the space of one second, you have an approximately one in one-million chance of observing the bug.

The time between the two parties in a bug may be glacial in scope relative to the size of the race window, which is vanishingly small in this one-second example.
You should expect that a one in one-million bug that could occur once every ten seconds actually does occur at a rate of about one every ten million seconds, or 2700 machine hours.
So, how does Jinx make this better? Remember when I talked about the simulation approach? Well, after every simulation is done, Jinx creates a graph that looks like this:

When Jinx considers how to reorder events, it only considers reordering A with respect to 1, 2, 3, or 4. Reordering with respect to 2 or 3 force the bug to occur.
Or, in a representation which is closer to the way that Jinx sees it:

Jinx ignores all noncommunicating instructions. What is left is the set of accesses that could have meaningful reorderings. For example, access A has meaningful reorderings with respect to 1, 2, 3, and 4.
Jinx uses this representation to decide what to change in subsequent simulations. Jinx wants to reorder memory accesses with respect for each other. For example, if Jinx decides to reorder memory access A, the only interesting points to reorder with respect to are 1, 2, 3, and 4 If we rearrange with respect to 1, then the observer runs before the corrupter, so no bug. If we rearrange with respect to 4, we get a little lock contention but no difference in behavior. But, if we rearrange with respect to 2 or 3, then we force the rare circumstance to occur. Jinx has turned a one-in-a-million event into a two-in-four event! There are so few cases, we can enumerate all of them:

Of four interleavings suggested by Jinx, two reveal bugs, as they result in accesses to radius and area between the updates to radius and area.
This is Jinx’s notion of communication logical time. Instead of rearranging events in wallclock time, we rearrange events in logical time. This cuts out the huge amount of temporal noise involved in hard-to-reproduce bugs. In this example, this transition to logical time changes a 2700 machine hour repro to a 20 second repro. Give Jinx a factor of 100 slowdown during simulation, and it’s still reproducing the bug 10,000x faster than reality would. Debugging a corruption bug that occurs in an hour on one machine is a completely different story than debugging one that takes 2700 machine hours.
So by transforming execution into logical communication time, Jinx forces even non-crashing correctness bugs to occur more frequently than they would in the wild, allowing you to find your non-deterministic bugs in development rather than test, and in test rather than deployment.
Next time I’ll talk about how Jinx makes it easy to understand a bug when it has occurred.

Trackbacks /
Pingbacks