Performance has been the primary objective in the design of the Parallel ESSL subroutines. To achieve this performance goal, the Parallel ESSL subroutines use "state-of-the-art" algorithms tailored to specific operational characteristics of the hardware. In addition, Parallel ESSL will leverage the high performance provided by ESSL for AIX for processor computations.
The Parallel ESSL library you may use depends on:
Table 29. Parallel ESSL Libraries used with the MPI Libraries
Node Type | MPI Library | Parallel ESSL Library |
---|---|---|
SMP(UMSMUS) or Serial | MPI Signal-Handling Library | Parallel ESSL Serial Library |
SMP(UMSMUS) or Serial | MPI Threads Library | Parallel ESSL SMP Library(IYCTSM) |
Notes:
|
The following techniques are used by most subroutines to optimize performance:
The following items also impact performance. They generally depend on the specific parallel routine being called. See the subroutine description in the reference section for any exceptions to these rules.
Choosing the number of processors depends primarily on the problem size. It is reasonable to increase the number of processors, if the global problem size increases sufficiently to keep the amount of local data per process at a reasonable size. If, however, using more processes, such as 17 rather than 16, causes you to use a one-dimensional grid rather than a two-dimensional grid, performance may be degraded. See the next item.
For most subroutines, using a two dimensional (square or as close to square as possible) grid is suggested. For example, if sixteen processors were used, define a 4 by 4 process grid. For exceptions to this rule, see the subroutine descriptions in the reference section.
See the following table for suggested block sizes. The optimal block
size depends on the underlying node computations, load balancing,
communications, system buffering requirements, problem size, and dimension and
shape of the process grid. To achieve optimal performance, generally
requires experimentation, but the values specified in Table 30 should provide good performance for most cases. For
exceptions to these rules, see the subroutine descriptions in the reference
section.
Table 30. Suggested Block Sizes
Area | Serial | SMP |
---|---|---|
Level 2 PBLAS | 24 |
24 (All subroutines, except PDTRSV and PZTRSV)
|
Level 3 PBLAS |
50-100 (Real subroutines) 30-50 (Complex subroutines) |
100-200 (Real subroutines) 50-100 (Complex subroutines) |
Dense Linear Algebraic Equations, except PDGEQRF and PZGEQRF |
50-100 (Real subroutines) 30-50 (Complex subroutines) |
100-200 (Real subroutines) 50-100 (Complex subroutines) |
Eigensystems Analysis and Singular Value Analysis, PDGEQRF, and PZGEQRF | 24 | 24 |
Random Number Generation | Data cache size / 2 | Data cache size / 2 |
|