Parallel Engineering and Scientific Subroutine Library for AIX Version 2 Release 3: Guide and Reference

Coding Tips for Optimizing Parallel Performance

Performance has been the primary objective in the design of the Parallel ESSL subroutines. To achieve this performance goal, the Parallel ESSL subroutines use "state-of-the-art" algorithms tailored to specific operational characteristics of the hardware. In addition, Parallel ESSL will leverage the high performance provided by ESSL for AIX for processor computations.

Choosing a Parallel ESSL Library

The Parallel ESSL library you may use depends on:

Your choice of MPI library.
- If you are using LAPI |or 64-bit-environment application programs, you may only use the |MPI threads library.
The type of nodes you are running on.

Table 29. Parallel ESSL Libraries used with the MPI Libraries

Node Type	MPI Library	Parallel ESSL Library
SMP^(UMSMUS) or Serial	MPI Signal-Handling Library	Parallel ESSL Serial Library
SMP^(UMSMUS) or Serial	MPI Threads Library	Parallel ESSL SMP Library^(IYCTSM)
Notes: Users may specify multiple user-space message-passing tasks per SMP node. For example, you could specify the number of user-space tasks per adapter equal to the number of CPUs in your SMP node. If you choose to spawn multiple user-space tasks per SMP node, you may consider explicitly setting the number of threads used by the Parallel ESSL SMP Library per message-passing task, by setting the environment variable XLSMPOPTS or OMP_NUM_THREADS. For further details, see the XLF or C for AIX manuals.

Node Type

MPI Library

Parallel ESSL Library

SMP^(UMSMUS) or Serial

MPI Signal-Handling Library

Parallel ESSL Serial Library

SMP^(UMSMUS) or Serial

MPI Threads Library

Parallel ESSL SMP Library^(IYCTSM)

Notes:

Users may specify multiple user-space message-passing tasks per SMP node. For example, you could specify the number of user-space tasks per adapter equal to the number of CPUs in your SMP node.
If you choose to spawn multiple user-space tasks per SMP node, you may consider explicitly setting the number of threads used by the Parallel ESSL SMP Library per message-passing task, by setting the environment variable XLSMPOPTS or OMP_NUM_THREADS. For further details, see the XLF or C for AIX manuals.

Parallel ESSL Techniques

The following techniques are used by most subroutines to optimize performance:

Minimizing the impact of communications by exchanging larger blocks of data
Blocking data to match the processor cache size

The following items also impact performance. They generally depend on the specific parallel routine being called. See the subroutine description in the reference section for any exceptions to these rules.

Number and types of processors (such as POWER Thin, POWER2 Thin, POWER3 Wide, |and POWER4)
Choosing the number of processors depends primarily on the problem size. It is reasonable to increase the number of processors, if the global problem size increases sufficiently to keep the amount of local data per process at a reasonable size. If, however, using more processes, such as 17 rather than 16, causes you to use a one-dimensional grid rather than a two-dimensional grid, performance may be degraded. See the next item.
Shape of process grid
For most subroutines, using a two dimensional (square or as close to square as possible) grid is suggested. For example, if sixteen processors were used, define a 4 by 4 process grid. For exceptions to this rule, see the subroutine descriptions in the reference section.

Block size(s)

See the following table for suggested block sizes. The optimal block size depends on the underlying node computations, load balancing, communications, system buffering requirements, problem size, and dimension and shape of the process grid. To achieve optimal performance, generally requires experimentation, but the values specified in Table 30 should provide good performance for most cases. For exceptions to these rules, see the subroutine descriptions in the reference section.

Table 30. Suggested Block Sizes

Area	Serial	SMP
Level 2 PBLAS	24	24 (All subroutines, except PDTRSV and PZTRSV) 48 (PDTRSV and PZTRSV)
Level 3 PBLAS	50-100 (Real subroutines) 30-50 (Complex subroutines)	100-200 (Real subroutines) 50-100 (Complex subroutines)
Dense Linear Algebraic Equations, except PDGEQRF and PZGEQRF	50-100 (Real subroutines) 30-50 (Complex subroutines)	100-200 (Real subroutines) 50-100 (Complex subroutines)
Eigensystems Analysis and Singular Value Analysis, PDGEQRF, and PZGEQRF	24	24
Random Number Generation	Data cache size / 2	Data cache size / 2
Note: The data cache size can be obtained by utilizing a code fragment, shown in PDURNG--Uniform Random Number Generator, under Notes and Coding Rules.

If you are using the Parallel ESSL SMP Library, your performance may be improved by setting the following environment variables:

export MALLOCMULTIHEAP=true
export XLSMPOPTS="spins=0:yields=0"

Note:
For details, see the XLF, C, or AIX manuals.
If you are using the |MPI threads library, for a single message-passing thread, specify MP_SINGLE_THREAD=yes to minimize thread overhead.
If you are using multiple |message-passing tasks per node, specify MP_SHARED_MEMORY=yes to specify the use of shared memory (instead of IP or the SP Switch) for message passing between tasks running on the same node.
You should be able to improve performance of production-level code by using the PESSL_ERROR_SYNC environment variable to disable error synchronization. For details, see PESSL_ERROR_SYNC Environment Variable.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]