Simplifying What You're Debugging
It's time to say something pretty simple: the reason you are using a debugger is because your program isn't operating correctly and the way you think you're going to solve the problem (unless it is a &%$# operating system problem, which, of course, it usually is) is by stopping your program's threads, examining the values assigned to variables, and stepping your program so you can see what is happening as it executes.
Unfortunately, your multiprocess, multithreaded program and the computers upon which it is executing have lots of things executing that you want TotalView to ignore. For example, you don't want to be examining manager and service threads created by the operating system, your programming environment, and your program.
Also, most of us are incapable of understanding exactly how a program is acting when perhaps thousands of processes are executing asynchronously. Fortunately, there are only a few problems that require full asynchronous behavior.
One of the first simplifications you can make is to change the number of processes. For example, suppose you have a buggy MPI program running on 100 processors. Your first step might be to have it execute in a 4-processor environment.
After you get the program running under TotalView's control, you will want to run the process being debugged to an action point, so you can inspect the program's state at that place. In many cases, because your program has places where processes are forced to wait for an interaction with other processes, you can ignore what they are doing.
Note: TotalView lets you control as many groups, processes, or threads as you need to control. While each can be controlled individually, you will probably have problems remembering what you're doing if you're controlling large numbers of these things. The reason that TotalView creates and manages groups is so that you can focus on portions of your program.
In most cases, you do not need to interact with everything that is executing. Instead, you want to focus on one process and the data that this process is manipulating. Things get complicated when the process being investigated is using data created by other processes, and these processes may have dependencies on other processes.
All this means that there is a rather typical pattern to the way you use TotalView to locate problems:
- At some point, you should make sure that the groups you are manipulating do not contain service or manager threads. (You can remove processes and threads from a group with the dgroups -remove command.)
- Place an action point within a process or thread and begin investigating the problem. In many cases, you are setting an action point at a place where you hope the program is still executing correctly. Because you are debugging a multiprocess, multithreaded program, you want to set a barrier point so that all threads and process are at the same place.
- After execution stops at the barrier point, look at the contents of your variables. Verify that your program state is actually correct.
- Begin stepping your program through its code. In most cases, step your program synchronously stepping or set barriers so that everything isn't running freely.
- Here's where things begin to get complicated. You've been focusing on one process or thread. If another process or thread is modifying the data and you become convinced that this is the problem, you'll want to go off to it and see what is going on.
The trick here, and it really isn't much of a trick, is keeping your focus narrow, so that you're just investigating a limited number of behaviors. This is where debugging becomes an art. A multiprocess, multithreaded program can be doing a great number of things. Understanding where to look when problems occur is the "art".
For example, you'll most often want to execute commands at the default focus. Only when you think that the problem is occurring in another process will you change to that process. You'll still be executing in a default focus, but this time the default focus is focussed at this other process.
In contrast, while you will often want to do something using another focus, what you will probably do is:
- Modify the focus so that it affects just the next command. For example, here's the command that steps thread 7 in process 3:
dfocus t3.7 dstep
(In this example, the dfocus directive tells TotalView to limit the scope of what it does for the command that immediately follows and then, after the command completes, to restore the old focus.)
- Use the dfocus command to change focus temporarily, execute a few commands, and then return to the original focus.