Programmer Musings: Unintuitive Multithreading: Troubleshooting

This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.

October 23, 2005

Unintuitive Multithreading: Troubleshooting

In the previous four articles in my Unintuitive Multithreading series, I focused mostly on design of a multithreaded application. This time, I plan to cover a few techniques that I have found helpful in troubleshooting multithreaded applications.

To some extent, the normal techniques you have used for troubleshooting will work in a multithreaded program. But, some of the problems in a multihreaded program will defy your normal tactics. A large fraction of these problems will turn out to be race conditions. By their nature, race conditions are hard to duplicate and are not easy to reason about. There are a set of models that I have found that sometimes shed light on these kinds of problems. These models are:

Lock Step
Leap Frog
Simultaneous Execution
Instruction Interruption

Each of these approaches helps you to visualize ways in which threads may interact to cause you problems.

Lock Step

The Lock Step model is fairly simple. Pick two threads that might interact and
mentally execute them one statement at a time, thread A, then thread B, then A,
then B... Some of the most classic race conditions are instantly identifiable
with this technique. For example, in trying to avoid the cost of a mutex, some programmers new to multithreading will try something like this.

  static bool locked = false;



  while(locked) ; // wait on the lock



  locked = true;

  do_critical();

  locked = false;

The lock step scenario shows the fallacy here immediately.

Thread A executes the while.
Thread B executes the while.
Thread A sets locked to true.
Thread B sets locked to true.
Thread A executes the do_critical().
Thread B executes the do_critical().
Thread A sets locked to false.
Thread B sets locked to false.

Without an understanding of multithreading, a junior programmer might convince himself that the code would work. And it will. Most of the time. Since it has to work all of the time, we have a problem.

Let's take this example a little farther. Say our junior programmer decides to get a bit more clever.



  static int locked = 0;



  while(locked++) // wait on the lock

      --locked;



  do_critical();

  locked = 0;

  non_critical();

This won't fall to the lock step model from before, but it will cause problems if we lockstep with an offset.

Thread A executes the do_critical().
Thread B executes the while(locked++).
Thread A sets locked to 0.
Thread B decrements locked to -1.
Thread A executes the non_critical().
Thread B executes the while(locked++) and continues to be locked out.

Walking through the code in lock step with a couple of threads can point out subtle timing issues in your MT code. The threads don't have to be executing the same code for this approach to work. There can be an offset as in the second example, or the threads could be executing completely separate pieces of code.

It should be obvious that if you don't do the lock step exactly correct, you may miss a race condition. This is why most race conditions succeed most of the time. The key to using this technique is to walk through the suspected code with an awareness for ways that things might go wrong. If something looks suspicious, change the offset between the threads and walk through again. You'll often find that you become more and more suspicious of a piece of code as you get closer to the right timing.

Leap Frog

The Leap Frog model is similar to lock step except that you execute more than one instruction at a time, with first one thread leading then the other leading. This approach is often useful in cases where using lock step had made you suspicious of some code but had not actually triggered a race condition.

   if(ptr)

   {

       void *tmp = ptr;

       ptr = 0;

       free( tmp );

   }

Thread A tests ptr.
Thread B tests the ptr.
Thread B saves ptr in tmp.
Thread B resets ptr to null.
Thread A saves ptr in tmp.
Thread A resets ptr to null.
Thread A executes the free() on a null pointer.
Thread B executes the free() on the real pointer.

Running this in lockstep would have spotted a potential double free. Looking at this code in leap frog mode shows a different problem. It is possible to free() a null pointer which is undefined behavior in C and C++. Depending on which symptom you are trying to match, different scenarios help you to see different things.

Simultaneous Execution

On a single CPU machine, you are guaranteed that the CPU can only be executing 1 thread at a time. Even though it appears that two or more threads are executing simultaneously, the reality is that we are executing a little of one, then a little of the next, and so on. The machine loops through the list of runnable threads often enough that they all appear to be running at once to us.

On a multiple CPU machine, this is no longer true. You can have, at most, one thread per CPU executing at the same time. In other words, on a dual CPU machine, two threads can execute the following expression at exactly the same time.

unlocked++;

This can lead to a situation even more pathological than the lock step case we explored above. This is why some "multi-threaded" code works fine until the day it is run on a multiple CPU machine for the first time. There may have been race conditions that did not manifest because no threads were actually executing at exactly the same time. Once that situation occurred, these new race conditions manifested themselves.

Instruction Interruption

Many new MT programmers make the implicit assumption that each C expression is atomic. Actually, even on a single processor machine, we can be interrupted between any two assembler instructions. Here's an example to show what I mean.

  static int initialized = 0;

  struct singleton onlyOnce;



  if(!initialized++)

  {

      initialize( &onlyOnce );

  }

The idea is that we only want to initialize the onlyOnce structure the first time through this code. The problem is that the if is much more complicated than it looks. In a kind of pseudo-assembler, the if would look something like this:

  move reg1, initialized

  add reg2, 1, reg1

  test reg1, 0

  jumpne label

This piece of code could be interrupted before it starts, after it ends, or at any of the three points between instructions. Almost every C or C++ statement transforms into a set of instructions like this. Even on a single CPU machine, that means that you can be interrupted in the middle of almost any expression. Unlike many junior programmers seem to believe, ++i is no more atomic an operation than i = i + 1. (At least, you are not guaranteed that it will be any different.)

Summary

To summarize, many of the hardest problems to troubleshoot in a multithreaded application are race conditions. Race conditions are particularly hard to find, because they involve the interaction between two threads. This article lists four different ways to approach these interactions to help you see potential race conditions.

Posted by GWade at October 23, 2005 04:21 PM. Email comments