General Parallel Debugging Tips

Here are some tips that are useful for debugging most parallel programs:

  • Breakpoint behavior

    When you are debugging message-passing and other multiprocess programs, it is usually easier to understand the program's behavior if you change the default stopping action of breakpoints and barrier breakpoints. By default, when one process in a multiprocess program hits a breakpoint, TotalView will stop all the other processes.

    To change the default stopping action of breakpoints and barrier breakpoints, you can set TotalView preferences. Information on this preferences can be found in the online help.

    A second method is to specify the -no_stop_all TotalView command-line options and -no_barr_stop_all.

    These settings set breakpoint and barrier breakpoint behavior. These options tell TotalView if it should allow other processes and threads to continue to run when a process or thread hits the breakpoint.

    These options only affect the default behavior. As usual, you can choose a behavior for a breakpoint by setting the breakpoint properties in the File > Preference's Action Points Pane. See Breakpoints for Multiple Processes.

  • Process synchronization

    TotalView has two features that make it easier to get all of the processes in a multiprocess program synchronized and executing a line of code.

    Process barrier breakpoints and the process hold/release features work together to help you get control the execution of your processes. See Barrier Breakpoints.

    The Process Window's Group > Run To command is a special kind of stepping command. It allows you to run a group of processes to a selected source line or instruction. See Group-Width Stepping.

  • Using group commands

    Group commands are often more useful than process commands.

    It is often more useful to use the Group > Go command to restart the whole application instead of the Process > Go command. You would then use the Group > Halt command instead of Process > Halt.

    The group-level single-stepping commands such as Group > Step and Group > Next allow you to single-step a group of processes in a parallel. See Group-Width Stepping.

  • Process-level stepping

    If you use a process-level single-stepping command in a multiprocess program, TotalView may appear to be hung (it continuously displays the watch cursor). If you single-step a process over a statement that cannot complete without allowing another process to run and that process is stopped, the stepping process appears to hang. This can occur, for example, when you try to single-step a process over a communication operation that cannot complete without the participation of another process. When this happens, you can abort the single-step operation by selecting Cancel in the Waiting for Command to Complete window that will appear. As an alternative, consider using a group-level single-step command instead.

    Note:   Etnus receives many bug reports about processes being hung. In almost all cases, the reason is that one process is waiting for another. Using the Group debugging commands almost always solves this problem.

  • Determining which processes and threads are executing

    The TotalView Root Window helps you determine where various processes and threads are executing. When you select a line of code in the Process Window, the Root Window Attached Page is updated to show which processes and threads are executing that line. See Displaying Thread and Process Locations.

  • Viewing variable values

    You can view (laminate) the value of a variable that is replicated across multiple processes or multiple threads in a single Variable Window. See Displaying a Variable in All Processes or Threads.

  • Restarting from within TotalView

    You can restart a parallel program at any time. If your program runs too far, you can kill the program by selecting the Group > Delete command. This command kills the master process and all the slave processes. Restarting the master process (for example, mpirun or poe) recreates all of the slave processes. Startup is faster when you do this because TotalView does not need to reread the symbol tables or restart its server processes since they are already running.

 
 
 
 
support@etnus.com
Copyright © 2001, Etnus, LLC. All rights reserved.
Version 5.0