Parallel Engineering and Scientific Subroutine Library for AIX Version 2 Release 3: Guide and Reference

PDGEMM and PZGEMM--Matrix-Matrix Product for a General Matrix, Its Transpose, or Its Conjugate Transpose

PDGEMM performs any one of the following combined matrix computations:

C<--alphaAB+betaC

C<--alphaAB^T+betaC

C<--alphaA^TB+betaC

C <-- alphaA^TB^T+betaC

PZGEMM performs any one of the following combined matrix computations:

C <-- alphaAB+betaC

C<--alphaAB^T+betaC

C<--alphaA^TB+betaC

C <-- alphaA^TB^T+betaC

C <--alphaA^HB+betaC

C <-- alphaA^HB^T+betaC

C<--alphaAB^H+betaC

C <-- alphaA^TB^H+betaC

C < alphaA^HB^H+betaC

where, in the PDGEMM and PZGEMM formulas above:

A represents the global general submatrix:

For transa = 'N', it is A_{ia:ia+m-1,
ja:ja+k-1}.
For transa = 'T' or 'C', it is A_{ia:ia+k-1,
ja:ja+m-1}.

B represents the global general submatrix:

For transb = 'N', it is B_{ib:ib+k-1,
jb:jb+n-1}.
For transb = 'T' or 'C', it is B_{ib:ib+n-1,
jb:jb+k-1}.

C represents the global general submatrix C_{ic:ic+m-1,
jc:jc+n-1}.

alpha and beta are scalars.

Note:: No data should be moved to form A^T, A^H, B^T, or B^H; that is, the A and B matrices should always be stored in their untransposed forms.

In the following four cases, no computation is performed and the subroutine returns after doing some parameter checking:

m = 0
n = 0
alpha is zero and beta is one.
k = 0 and beta is one.

Assuming the above conditions do not exist, if beta is not one and k is 0, then betaC is returned.

See references [14] and [15].

Table 45. Data Types

A, B, C, alpha, beta	Subroutine
Long-precision real	PDGEMM
Long-precision complex	PZGEMM

Syntax

Fortran	CALL PDGEMM \| PZGEMM (`transa`, `transb`, `m`, `n`, `k`, `alpha`, `a`, `ia`, `ja`, `desc_a`, `b`, `ib`, `jb`, `desc_b`, `beta`, `c`, `ic`, `jc`, `desc_c`)
C and C++	pdgemm \| pzgemm (`transa`, `transb`, `m`, `n`, `k`, `alpha`, `a`, `ia`, `ja`, `desc_a`, `b`, `ib`, `jb`, `desc_b`, `beta`, `c`, `ic`, `jc`, `desc_c`);

On Entry

transa

indicates the form of matrix A to use in the computation, where:

If transa = 'N', A is used in the computation.

If transa = 'T', A^T is used in the computation.

If transa = 'C', A^H is used in the computation.

Scope: global

Specified as: a single character; transa = 'N', 'T', or 'C'

transb

indicates the form of matrix B to use in the computation, where:

If transb = 'N', B is used in the computation.

If transb = 'T', B^T is used in the computation.

If transb = 'C', B^H is used in the computation.

Scope: global

Specified as: a single character; transb = 'N', 'T', or 'C'

m

is the number of rows in submatrix C used in the computation, and:

If transa = 'N', it is the number of rows in submatrix A.

If transa = 'T' or 'C', it is the number of columns in submatrix A.

Scope: global

Specified as: a fullword integer; m >= 0.

n

is the number of columns in submatrix C used in the computation, and:

If transb = 'N', it is the number of columns in submatrix B.

If transb = 'T' or 'C', it is the number of rows in submatrix B.

Scope: global

Specified as: a fullword integer; n >= 0.

k

has the following meaning:

If transa = 'N', it is the number of columns in submatrix A.

If transa = 'T' or 'C', it is the number of rows in submatrix A.

In addition:

If transb = 'N', it is the number of rows in submatrix B.

If transb = 'T' or 'C', it is the number of columns in submatrix B.

Scope: global

Specified as: a fullword integer; k >= 0.

alpha

is the scalar alpha.

Scope: global

Specified as: a number of the data type indicated in Table 45.

a

is the local part of the global general matrix A. This identifies the first element of the local array A. This subroutine computes the location of the first element of the local subarray used, based on ia, ja, desc_a, p, q, myrow, and mycol; therefore:

If transa = 'N', the leading LOCp(ia+m-1) by LOCq(ja+k-1) part of the local array A must contain the local pieces of the leading ia+m-1 by ja+k-1 part of the global matrix.
If transa = 'T' or 'C', the leading LOCp(ia+k-1) by LOCq(ja+m-1) part of the local array A must contain the local pieces of the leading ia+k-1 by ja+m-1 part of the global matrix.

Note:: No data should be moved to form A^T or A^H; that is, the matrix A should always be stored in its untransposed form.

Scope: local

Specified as: an LLD_A by (at least) LOCq(N_A) array, containing numbers of the data type indicated in Table 45. Details about the block-cyclic data distribution of global matrix A are stored in desc_a.

ia

is the row index of the global matrix A, identifying the first row of the submatrix A.

Scope: global

Specified as: a fullword integer; 1 <= ia <= M_A, and:

If transa = 'N', then ia+m-1 <= M_A.

If transa = 'T' or 'C', then ia+k-1 <= M_A.

ja

is the column index of the global matrix A, identifying the first column of the submatrix A.

Scope: global

Specified as: a fullword integer; 1 <= ja <= N_A, and:

If transa = 'N', then ja+k-1 <= N_A.

If transa = 'T' or 'C', then ja+m-1 <= N_A.

desc_a

is the array descriptor for global matrix A, described in the following table:

`desc_a`	Name	Description	Limits	Scope
1	DTYPE_A	Descriptor type	DTYPE_A=1	Global
2	CTXT_A	BLACS context	Valid value, as returned by BLACS_GRIDINIT or BLACS_GRIDMAP	Global
3	M_A	Number of rows in the global matrix	If `m` = 0 or `k` = 0: M_A >= 0 Otherwise: M_A >= 1	Global
4	N_A	Number of columns in the global matrix	If `m` = 0 or `k` = 0: N_A >= 0 Otherwise: N_A >= 1	Global
5	MB_A	Row block size	MB_A >= 1	Global
6	NB_A	Column block size	NB_A >= 1	Global
7	RSRC_A	The process row of the `p` × `q` grid over which the first row of the global matrix is distributed	0 <= RSRC_A < `p`	Global
8	CSRC_A	The process column of the `p` × `q` grid over which the first column of the global matrix is distributed	0 <= CSRC_A < `q`	Global
9	LLD_A	The leading dimension of the local array	LLD_A >= max(1,LOCp(M_A))	Local

Specified as: an array of (at least) length 9, containing fullword integers.

b

is the local part of the global general matrix B. This identifies the first element of the local array B. This subroutine computes the location of the first element of the local subarray used, based on ib, jb, desc_b, p, q, myrow, and mycol; therefore:

If transb = 'N', the leading LOCp(ib+k-1) by LOCq(jb+n-1) part of the local array B must contain the local pieces of the leading ib+k-1 by jb+n-1 part of the global matrix.
If transb = 'T' or 'C', the leading LOCp(ib+n-1) by LOCq(jb+k-1) part of the local array B must contain the local pieces of the leading ib+n-1 by jb+k-1 part of the global matrix.

Note:: No data should be moved to form B^T or B^H; that is, the matrix B should always be stored in its untransposed form.

Scope: local

Specified as: an LLD_B by (at least) LOCq(N_B) array, containing numbers of the data type indicated in Table 45. Details about the block-cyclic data distribution of global matrix B are stored in desc_b.

ib

is the row index of the global matrix B, identifying the first row of the submatrix B.

Scope: global

Specified as: a fullword integer; 1 <= ib <= M_B, and:

If transb = 'N', then ib+k-1 <= M_B.

If transb = 'T' or 'C', then ib+n-1 <= M_B.

jb

is the column index of the global matrix B, identifying the first column of the submatrix B.

Scope: global

Specified as: a fullword integer; 1 <= jb <= N_B, and:

If transb = 'N', then jb+n-1 <= N_B.

If transb = 'T' or 'C', then jb+k-1 <= N_B.

desc_b

is the array descriptor for global matrix B, described in the following table:

`desc_b`	Name	Description	Limits	Scope
1	DTYPE_B	Descriptor type	DTYPE_B=1	Global
2	CTXT_B	BLACS context	Valid value, as returned by BLACS_GRIDINIT or BLACS_GRIDMAP	Global
3	M_B	Number of rows in the global matrix	If `k` = 0 or `n` = 0: M_B >= 0 Otherwise: M_B >= 1	Global
4	N_B	Number of columns in the global matrix	If `k` = 0 or `n` = 0: N_B >= 0 Otherwise: N_B >= 1	Global
5	MB_B	Row block size	MB_B >= 1	Global
6	NB_B	Column block size	NB_B >= 1	Global
7	RSRC_B	The process row of the `p` × `q` grid over which the first row of the global matrix is distributed	0 <= RSRC_B < `p`	Global
8	CSRC_B	The process column of the `p` × `q` grid over which the first column of the global matrix is distributed	0 <= CSRC_B < `q`	Global
9	LLD_B	The leading dimension of the local array	LLD_B >= max(1,LOCp(M_B))	Local

Specified as: an array of (at least) length 9, containing fullword integers.

beta

is the scalar beta.

Scope: global

Specified as: a number of the data type indicated in Table 45.

c

is the local part of the global general matrix C. This identifies the first element of the local array C. This subroutine computes the location of the first element of the local subarray used, based on ic, jc, desc_c, p, q, myrow, and mycol; therefore, the leading LOCp(ic+m-1) by LOCq(jc+n-1) part of the local array C must contain the local pieces of the leading ic+m-1 by jc+n-1 part of the global matrix.

When beta is zero, C need not be set on input.

Scope: local

Specified as: an LLD_C by (at least) LOCq(N_C) array, containing numbers of the data type indicated in Table 45. Details about the block-cyclic data distribution of global matrix C are stored in desc_c.

ic

is the row index of the global matrix C, identifying the first row of the submatrix C.

Scope: global

Specified as: a fullword integer; 1 <= ic <= M_C and ic+m-1 <= M_C.

jc

is the column index of the global matrix C, identifying the first column of the submatrix C.

Scope: global

Specified as: a fullword integer; 1 <= jc <= N_C and jc+n-1 <= N_C.

desc_c

is the array descriptor for global matrix C, described in the following table:

`desc_c`	Name	Description	Limits	Scope
1	DTYPE_C	Descriptor type	DTYPE_C=1	Global
2	CTXT_C	BLACS context	Valid value, as returned by BLACS_GRIDINIT or BLACS_GRIDMAP	Global
3	M_C	Number of rows in the global matrix	If `m` = 0 or `n` = 0: M_C >= 0 Otherwise: M_C >= 1	Global
4	N_C	Number of columns in the global matrix	If `m` = 0 or `n` = 0: N_C >= 0 Otherwise: N_C >= 1	Global
5	MB_C	Row block size	MB_C >= 1	Global
6	NB_C	Column block size	NB_C >= 1	Global
7	RSRC_C	The process row of the `p` × `q` grid over which the first row of the global matrix is distributed	0 <= RSRC_C < `p`	Global
8	CSRC_C	The process column of the `p` × `q` grid over which the first column of the global matrix is distributed	0 <= CSRC_C < `q`	Global
9	LLD_C	The leading dimension of the local array	LLD_C >= max(1,LOCp(M_C))	Local

Specified as: an array of (at least) length 9, containing fullword integers.

On Return

c

is the updated local part of the global matrix C, containing the results of the computation.

Scope: local

Returned as: an LLD_C by (at least) LOCq(N_C) array, containing numbers of the data type indicated in Table 45.

Notes and Coding Rules

These subroutines accept lowercase letters for the transa and transb arguments.
For PDGEMM, if you specify 'C' for the transa or transb argument, it is interpreted as though you specified 'T'.
The matrices must have no common elements; otherwise, results are unpredictable.
The NUMROC utility subroutine can be used to determine the values of LOCp(M_) and LOCq(N_) used in the argument descriptions above. For details, see Determining the Number of Rows and Columns in Your Local Arrays and NUMROC--Compute the Number of Rows or Columns of a Block-Cyclically Distributed Matrix Contained in a Process.
For suggested block sizes, see Coding Tips for Optimizing Parallel Performance.
The following values must be equal: CTXT_A = CTXT_B = CTXT_C.

The coding rules described in this note depend upon which matrix--A, B, or C--is used as the reference matrix, which is referred to, in general, as matrix X. For each of the three possible selections for the reference matrix, there is a unique set of coding rules that must be met. These are detailed in Table 46 and Table 47. Follow these steps to select a reference matrix and determine what coding rules to use:

Step 1: First, the reference matrix is selected. For optimal performance, the reference matrix is selected based on the arguments m, n, and k, as follows:

If k <= min(m, n), then X = C

If n <= min(m, k), then X = A

If m <= min(n, k), then X = B

The matrix selected must satisfy coding rules a and d, described below, to be a suitable reference matrix. If it does, you go to step 2. If it does not, then it checks to see if either of the other two matrices satisfies coding rules a, c, and d, making one of them a suitable reference matrix. If one of them is suitable, then you go to step 2. If neither matrix is suitable, an error condition results.

Step 2: After a suitable reference matrix is chosen in Step 2, all remaining coding rules, described below, are checked. If the rules are satisfied, the subroutine continues normally. If they are not, an error condition results.

Coding Rules: Following are the coding rules:

The reference matrix must be aligned on a block boundary; that is:

ix-1 must be a multiple of MB_X.
jx-1 must be a multiple of NB_X.

These indexes are indicated in column 5 of Table 46 for each entry for X.
The block sizes that must be equal are indicated in column 4 of Table 46 for each entry for X. The rules for block sizes depend only upon the values of transa and transb, and not on the reference matrix selected; however, for your convenience, the rules are repeated in the table for each reference matrix.
Given the reference matrix X, additional rules apply to the block row and block column offsets of the two nonreference matrices. These rules are listed in column 7 of Table 46 for each entry for X. These rules must only be met when looping is required--that is, either of the conditions in column 8 is met.

The indexes of the nonreference matrices, which need to be on a block boundary, are listed in column 6 of Table 46 for each entry for X.

Table 46. Coding Rules for the Reference Matrix X

-1- X	-2- `transa`	-3- `transb`	-4- (b) Equal Block Sizes	-5- (a) Block Bndry For X	-6- (d) Block Bndry For Other	-7- (c) Equal Block Offsets (If Looping is Required)	-8- (c) Conditions For Looping
A	'N'	'N'	MB_A = MB_C NB_B = NB_C NB_A = MB_B	`ia, ja`	`ib, ic`	mod(`jb`-1, NB_B) = mod(`jc`-1, NB_C)	`n`+mod(`jb`-1, NB_B) > NB_B -or- `n`+mod(`jc`-1, NB_C) > NB_C
A	'N'	'T' or 'C'	MB_A = MB_C MB_B = NB_C NB_A = NB_B	`ia, ja`	`jb, ic`	mod(`ib`-1, MB_B) = mod(`jc`-1, NB_C)	`n`+mod(`ib`-1, MB_B) > MB_B -or- `n`+mod(`jc`-1, NB_C) > NB_C
A	'T' or 'C'	'N'	NB_A = MB_C NB_B = NB_C MB_A = MB_B	`ia, ja`	`ib, ic`	mod(`jb`-1, NB_B) = mod(`jc`-1, NB_C)	`n`+mod(`jb`-1, NB_B) > NB_B -or- `n`+mod(`jc`-1, NB_C) > NB_C
A	'T' or 'C'	'T' or 'C'	NB_A = MB_C MB_B = NB_C MB_A = NB_B	`ia, ja`	`jb, ic`	mod(`ib`-1, MB_B) = mod(`jc`-1, NB_C)	`n`+mod(`ib`-1, MB_B) > MB_B -or- `n`+mod(`jc`-1, NB_C) > NB_C
B	'N'	'N'	MB_A = MB_C NB_B = NB_C NB_A = MB_B	`ib, jb`	`ja, jc`	mod(`ia`-1, MB_A) = mod(`ic`-1, MB_C)	`m`+mod(`ia`-1, MB_A) > MB_A -or- `m`+mod(`ic`-1, MB_C) > MB_C
B	'N'	'T' or 'C'	MB_A = MB_C MB_B = NB_C NB_A = NB_B	`ib, jb`	`ja, jc`	mod(`ia`-1, MB_A) = mod(`ic`-1, MB_C)	`m`+mod(`ia`-1, MB_A) > MB_A -or- `m`+mod(`ic`-1, MB_C) > MB_C
B	'T' or 'C'	'N'	NB_A = MB_C NB_B = NB_C MB_A = MB_B	`ib, jb`	`ia, jc`	mod(`ja`-1, NB_A) = mod(`ic`-1, MB_C)	`m`+mod(`ja`-1, NB_A) > NB_A -or- `m`+mod(`ic`-1, MB_C) > MB_C
B	'T' or 'C'	'T' or 'C'	NB_A = MB_C MB_B = NB_C MB_A = NB_B	`ib, jb`	`ia, jc`	mod(`ja`-1, NB_A) = mod(`ic`-1, MB_C)	`m`+mod(`ja`-1, NB_A) > NB_A -or- `m`+mod(`ic`-1, MB_C) > MB_C
C	'N'	'N'	MB_A = MB_C NB_B = NB_C NB_A = MB_B	`ic, jc`	`ia, jb`	mod(`ja`-1, NB_A) = mod(`ib`-1, MB_B)	`k`+mod(`ja`-1, NB_A) > NB_A -or- `k`+mod(`ib`-1, MB_B) > MB_B
C	'N'	'T' or 'C'	MB_A = MB_C MB_B = NB_C NB_A = NB_B	`ic, jc`	`ia, ib`	mod(`ja`-1, NB_A) = mod(`jb`-1, NB_B)	`k`+mod(`ja`-1, NB_A) > NB_A -or- `k`+mod(`jb`-1, NB_B) > NB_B
C	'T' or 'C'	'N'	NB_A = MB_C NB_B = NB_C MB_A = MB_B	`ic, jc`	`ja, jb`	mod(`ia`-1, MB_A) = mod(`ib`-1, MB_B)	`k`+mod(`ia`-1, MB_A) > MB_A -or- `k`+mod(`ib`-1, MB_B) > MB_B
C	'T' or 'C'	'T' or 'C'	NB_A = MB_C MB_B = NB_C MB_A = NB_B	`ic, jc`	`ja, ib`	mod(`ia`-1, MB_A) = mod(`jb`-1, NB_B)	`k`+mod(`ia`-1, MB_A) > MB_A -or- `k`+mod(`jb`-1, NB_B) > NB_B

Additional rules apply to the row and column alignment of the various matrices in the process grid; specifically, the process row or process column containing the first row or column of the reference submatrix X, respectively, must also contain the first row or column of one of the other two nonreference submatrices, as indicated in column 4 of Table 47 for each entry for X. Following is the definition of ixrow and ixcol, which holds true for A, B, and C:

ixrow = mod((((ix-1)/MB_X)+RSRC_X), p)

ixcol = mod((((jx-1)/NB_X)+CSRC_X), q)

Table 47. Coding Rules for the Reference Matrix X

-1- X	-2- `transa`	-3- `transb`	-4- (e) Process Grid Alignment
A	'N'	'N'	`iarow` = `icrow`
A	'N'	'T' or 'C'	`iarow` = `icrow` `ibcol` = `iacol`
A	'T' or 'C'	'N'	`iarow` = `ibrow`
A	'T' or 'C'	'T' or 'C'	(no rules)
B	'N'	'N'	`ibcol` = `iccol`
B	'N'	'T' or 'C'	`ibcol` = `iacol`
B	'T' or 'C'	'N'	`iarow` = `ibrow` `ibcol` = `iccol`
B	'T' or 'C'	'T' or 'C'	(no rules)
C	'N'	'N'	`iarow` = `icrow` `ibcol` = `iccol`
C	'N'	'T' or 'C'	`iarow` = `icrow`
C	'T' or 'C'	'N'	`ibcol` = `iccol`
C	'T' or 'C'	'T' or 'C'	(no rules)

Example: Following is an example of the coding rules necessary for the case where transa = 'N' and transb = 'N', where the reference matrix selected is A. Following are the indexes, dimensions, and block sizes used in the computation for the matrices:

Indexes:        ic  jc             ia  ja        ib  jb             ic  jc
                 |   |              |   |         |   |              |   |
Dimensions:  C ( m , n )  
<--  alpha  A ( m , k )   B ( k , n )  +  beta  C ( m , n )
                 |   |              |   |         |   |              |   |
Block Sizes:   MB_C NB_C          MB_A NB_A     MB_B NB_B          MB_C NB_C

A must be aligned on a block boundary, as indicated in column 5 in Table 46:

ia-1 must be a multiple of MB_A.
ja-1 must be a multiple of NB_A.
The block sizes that correspond to each matrix dimension must be equal, where MB_ represents the row dimension and NB_ represents the column dimension, as indicated in column 4 in Table 46:

MB_A = MB_C
NB_B = NB_C
NB_A = MB_B
As shown above, m and k are the dimensions of the reference matrix A; therefore, n is used to determine if looping is required; that is, if one of the following is true, as indicated in column 8 in Table 46:

n+mod(jc-1, NB_C) > NB_C
n+mod(jb-1, NB_B) > NB_B

then the following offsets must be equal, as indicated in column 7 in Table 46:

mod(jb-1, NB_B) = mod(jc-1, NB_C)
The other indexes from each of the nonreference matrices--not used in c above--must be aligned on a block boundary, as indicated in column 6 in Table 46:

ic-1 must be a multiple of MB_C.
ib-1 must be a multiple of MB_B.
In the process grid, the process row containing the first row of the submatrix A must also contain the first row of the submatrix C, as indicated in column 4 in Table 47; that is, iarow = icrow, where:

iarow = mod((((ia-1)/MB_A)+RSRC_A), p)
icrow = mod((((ic-1)/MB_C)+RSRC_C), p)

DTYPE_A is invalid.
DTYPE_B is invalid.
DTYPE_C is invalid.

Stage 2:

CTXT_A is invalid.

Stage 3:

The subroutine was called from outside the process grid.

Stage 4:

transa <> 'N', 'T', or 'C'
transb <> 'N', 'T', or 'C'
m < 0
n < 0
k < 0
M_A < 0 and (m = 0 or k = 0); M_A < 1 otherwise
N_A < 0 and (m = 0 or k = 0); N_A < 1 otherwise
M_B < 0 and (k = 0 or n = 0); M_B < 1 otherwise
N_B < 0 and (k = 0 or n = 0); N_B < 1 otherwise
M_C < 0 and (m = 0 or n = 0); M_C < 1 otherwise
N_C < 0 and (m = 0 or n = 0); N_C < 1 otherwise
ia < 1
ib < 1
ic < 1
ja < 1
jb < 1
jc < 1
MB_A < 1
MB_B < 1
MB_C < 1
NB_A < 1
NB_B < 1
NB_C < 1
RSRC_A < 0 or RSRC_A >= p
RSRC_B < 0 or RSRC_B >= p
RSRC_C < 0 or RSRC_C >= p
CSRC_A < 0 or CSRC_A >= q
CSRC_B < 0 or CSRC_B >= q
CSRC_C < 0 or CSRC_C >= q
CTXT_A <> CTXT_B
CTXT_A <> CTXT_C

Stage 5:

If m <> 0 and k <> 0:

transa = 'N' and ia+m-1 > M_A
transa = 'T' or 'C' and ia+k-1 > M_A
transa = 'N' and ja+k-1 > N_A
transa = 'T' or 'C' and ja+m-1 > N_A
ia > M_A
ja > N_A

If n <> 0 and k <> 0:
transb = 'N' and ib+k-1 > M_B
transb = 'T' or 'C' and ib+n-1 > M_B
transb = 'N' and jb+n-1 > N_B
transb = 'T' or 'C' and jb+k-1 > N_B
ib > M_B
jb > N_B

If m <> 0 and n <> 0:
ic+m-1 > M_C
jc+n-1 > N_C
ic > M_C
jc > N_C
For the reference matrix (defined in note 7 in Notes and Coding Rules) and the appropriate transa and transb values, the indexes listed in column 5 of Table 46 are not aligned on a block boundary, where boundary alignment is defined as:

ix-1 must be a multiple of MB_X.
jx-1 must be a multiple of NB_X.
For the two nonreference matrices (defined in note 7 in Notes and Coding Rules) and the appropriate transa and transb values, the indexes listed in column 6 of Table 46 are not aligned on a block boundary. Using Z to represent one of the nonreference matrices, each boundary alignment is expressed as one of the following:

iz-1 must be a multiple of MB_Z.
jz-1 must be a multiple of NB_Z.
For the reference matrix (defined in note 7 in Notes and Coding Rules) and the appropriate transa and transb values, if looping occurs--that is, one of the conditions in column 8 of Table 46 is true--then the block offsets indicated in column 7 are not equal.

Stage 6:

For the appropriate transa and transb values indicated in Table 46 (where the reference matrix does not matter), some of the block sizes indicated in column 4 are not equal.
LLD_A < max(1, LOCp(M_A))
LLD_B < max(1, LOCp(M_B))
LLD_C < max(1, LOCp(M_C))
In the process grid, the process row or process column containing the first row or column of the reference submatrix X (defined in note 7 in Notes and Coding Rules), respectively, does not contain the first row or column of one of the other two nonreference submatrices, as indicated in column 4 of Table 47. Following is the definition of ixrow and ixcol, which holds true for A, B, and C:

ixrow = mod((((ix-1)/MB_X)+RSRC_X), p)
ixcol = mod((((jx-1)/NB_X)+CSRC_X), q)

Example 1

This example computes C = betaC+alphaAB using a 2 × 2 process grid.

Call Statements and Input

 ORDER = 'R'
 NPROW = 2
 NPCOL = 2
 CALL BLACS_GET (0, 0, ICONTXT)
 CALL BLACS_GRIDINIT(ICONTXT, ORDER, NPROW, NPCOL)
 CALL BLACS_GRIDINFO(ICONTXT, NPROW, NPCOL, MYROW, MYCOL)
 
            TRANSA TRANSB  M    N    K   ALPHA    A  IA  JA   DESC_A   B  IB  JB
               |      |    |    |    |     |      |   |   |     |      |   |   |
 CALL PDGEMM( 'N' ,  'N' , 6  , 4  , 5 , 1.0D0  , A , 1 , 1 , DESC_A , B , 1 , 1 ,
 
              DESC_B    BETA    C  IC  JC   DESC_C
                |         |     |   |   |     |
              DESC_B ,  2.0D0 , C , 1 , 1 , DESC_C )

	Desc_A	Desc_B	Desc_C
DTYPE_	1	1	1
CTXT_	`icontxt`^(IOBG18)	`icontxt`^(IOBG18)	`icontxt`^(IOBG18)
M_	6	5	6
N_	5	4	4
MB_	3	2	3
NB_	2	2	2
RSRC_	0	0	0
CSRC_	0	0	0
LLD_	See below^(EPSSL18)	See below^(EPSSL18)	See below^(EPSSL18)
Notes: `icontxt` is the output of the BLACS_GRIDINIT call. Each process should set the LLD_ as follows: LLD_A = MAX(1,NUMROC(M_A, MB_A, MYROW, RSRC_A, NPROW)) LLD_B = MAX(1,NUMROC(M_B, MB_B, MYROW, RSRC_B, NPROW)) LLD_C = MAX(1,NUMROC(M_C, MB_C, MYROW, RSRC_C, NPROW)) In this example, LLD_A = LLD_C = 3 on all processes, and LLD_B = 3 on P₁₀ and P₀₁ and LLD_B = 2 on P₁₀ and P₁₁.

Global general 6 × 5 matrix A with block size 3 × 2:

B,D        0             1          2
     *                                  *
     |  1.0  2.0  |  -1.0 -1.0  |   4.0 |
 0   |  2.0  0.0  |   1.0  1.0  |  -1.0 |
     |  1.0 -1.0  |  -1.0  1.0  |   2.0 |
     | -----------|-------------|------ |
     | -3.0  2.0  |   2.0  2.0  |   0.0 |
 1   |  4.0  0.0  |  -2.0  1.0  |  -1.0 |
     | -1.0 -1.0  |   1.0 -3.0  |   2.0 |
     *                                  *

The following is the 2 × 2 process grid:

B,D  |  0 2  | 1 
-----|-------|-----
0    |   P₀₀   |  P₀₁
-----|-------|-----
1    |   P₁₀   |  P₁₁

Local arrays for A:

p,q  |       0         |      1
-----|-----------------|------------
     |  1.0  2.0  4.0  |  -1.0 -1.0
 0   |  2.0  0.0 -1.0  |   1.0  1.0
     |  1.0 -1.0  2.0  |  -1.0  1.0
-----|-----------------|------------
     | -3.0  2.0  0.0  |   2.0  2.0
 1   |  4.0  0.0 -1.0  |  -2.0  1.0
     | -1.0 -1.0  2.0  |   1.0 -3.0

Global general 5 × 4 matrix B with block size 2 × 2:

B,D        0             1
     *                         *
 0   |  1.0 -1.0  |   0.0  2.0 |
     |  2.0  2.0  |  -1.0 -2.0 |
     | -----------|----------- |
 1   |  1.0  0.0  |  -1.0  1.0 |
     | -3.0 -1.0  |   1.0 -1.0 |
     | -----------|----------- |
 2   |  4.0  2.0  |  -1.0  1.0 |
     *                         *

The following is the 2 × 2 process grid:

B,D  |   0   | 1 
-----|-------|-----
0    |   P₀₀   |  P₀₁
2    |       |
-----|-------|-----
1    |   P₁₀   |  P₁₁

Local arrays for B:

p,q  |     0      |      1
-----|------------|------------
     |  1.0 -1.0  |   0.0  2.0
 0   |  2.0  2.0  |  -1.0 -2.0
     |  4.0  2.0  |  -1.0  1.0
-----|------------|------------
 1   |  1.0  0.0  |  -1.0  0.0
     | -3.0 -1.0  |   1.0 -1.0

Global general 6 × 4 matrix C with block size 3 × 2:

B,D        0             1
     *                         *
     |  0.5  0.5  |   0.5  0.5 |
 0   |  0.5  0.5  |   0.5  0.5 |
     |  0.5  0.5  |   0.5  0.5 |
     | -----------|----------- |
     |  0.5  0.5  |   0.5  0.5 |
 1   |  0.5  0.5  |   0.5  0.5 |
     |  0.5  0.5  |   0.5  0.5 |
     *                         *

The following is the 2 × 2 process grid:

B,D  |   0   | 1 
-----|-------|-----
0    |   P₀₀   |  P₀₁
-----|-------|-----
1    |   P₁₀   |  P₁₁

Local arrays for C:

p,q  |     0      |      1
-----|------------|------------
     |  0.5  0.5  |   0.5  0.5
 0   |  0.5  0.5  |   0.5  0.5
     |  0.5  0.5  |   0.5  0.5
-----|------------|------------
     |  0.5  0.5  |   0.5  0.5
 1   |  0.5  0.5  |   0.5  0.5
     |  0.5  0.5  |   0.5  0.5

Output:

Global general 6 × 4 matrix C with block size 3 × 2:

B,D         0               1
     *                             *
     |  24.0  13.0  |   -5.0   3.0 |
 0   |  -3.0  -4.0  |    2.0   4.0 |
     |   4.0   1.0  |    2.0   5.0 |
     | -------------|------------- |
     |  -2.0   6.0  |   -1.0  -9.0 |
 1   |  -4.0  -6.0  |    5.0   5.0 |
     |  16.0   7.0  |   -4.0   7.0 |
     *                             *

The following is the 2 × 2 process grid:

B,D  |   0   | 1 
-----|-------|-----
0    |   P₀₀   |  P₀₁
-----|-------|-----
1    |   P₁₀   |  P₁₁

Local arrays for C:

p,q  |      0       |       1
-----|--------------|--------------
     |  24.0  13.0  |   -5.0   3.0
 0   |  -3.0  -4.0  |    2.0   4.0
     |   4.0   1.0  |    2.0   5.0
-----|--------------|--------------
     |  -2.0   6.0  |   -1.0  -9.0
 1   |  -4.0  -6.0  |    5.0   5.0
     |  16.0   7.0  |   -4.0   7.0

Example 2

This example computes C = betaC+alphaAB using a 2 × 2 process grid.

Call Statements and Input

 ORDER = 'R'
 NPROW = 2
 NPCOL = 2
 CALL BLACS_GET (0, 0, ICONTXT)
 CALL BLACS_GRIDINIT(ICONTXT, ORDER, NPROW, NPCOL)
 CALL BLACS_GRIDINFO(ICONTXT, NPROW, NPCOL, MYROW, MYCOL)
 
           TRANSA TRANSB M   N   K       ALPHA       A  IA  JA   DESC_A   B  IB  JB
              |     |    |   |   |         |         |   |   |     |      |   |   |
 CALL PZGEMM('N' , 'N' , 6 , 2 , 3 , (1.0D0,0.0D0) , A , 1 , 1 , DESC_A , B , 1 , 1 ,
 
              DESC_B       BETA        C  IC  JC   DESC_C
                |            |         |   |   |     |
              DESC_B , (2.0D0,0.0D0) , C , 1 , 1 , DESC_C)

	Desc_A	Desc_B	Desc_C
DTYPE_	1	1	1
CTXT_	`icontxt`^(IOBG19)	`icontxt`^(IOBG19)	`icontxt`^(IOBG19)
M_	6	3	6
N_	3	2	2
MB_	2	2	2
NB_	2	2	2
RSRC_	0	0	0
CSRC_	0	0	0
LLD_	See below^(EPSSL19)	See below^(EPSSL19)	See below^(EPSSL19)
Notes: `icontxt` is the output of the BLACS_GRIDINIT call. Each process should set the LLD_ as follows: LLD_A = MAX(1,NUMROC(M_A, MB_A, MYROW, RSRC_A, NPROW)) LLD_B = MAX(1,NUMROC(M_B, MB_B, MYROW, RSRC_B, NPROW)) LLD_C = MAX(1,NUMROC(M_C, MB_C, MYROW, RSRC_C, NPROW)) In this example, LLD_A = 4 on P₀₀ and P₀₁ and LLD_A = 2 on P₁₀ and P₁₁. LLD_B = 2 on P₀₀ and LLD_B = 1 on P₁₀. LLD_C = 4 on P₀₀ and LLD_C = 2 on P₁₀.

Global general 6 × 3 matrix A with block size 2 × 2:

B,D              0                   1
     *                                      *
 0   |  (1.0,5.0)  (9.0,2.0)  |   (1.0,9.0) |
     |  (2.0,4.0)  (8.0,3.0)  |   (1.0,8.0) |
     | -----------------------|------------ |
 1   |  (3.0,3.0)  (7.0,5.0)  |   (1.0,7.0) |
     |  (4.0,2.0)  (4.0,7.0)  |   (1.0,5.0) |
     | -----------------------|------------ |
 2   |  (5.0,1.0)  (5.0,1.0)  |   (1.0,6.0) |
     |  (6.0,6.0)  (3.0,6.0)  |   (1.0,4.0) |
     *                                      *

The following is the 2 × 2 process grid:

B,D  |   0   | 1 
-----|-------|-----
0    |   P₀₀   |  P₀₁
2    |       |
-----|-------|-----
1    |   P₁₀   |  P₁₁

Local arrays for A:

p,q  |           0            |      1
-----|------------------------|-------------
     |  (1.0,5.0)  (9.0,2.0)  |   (1.0,9.0)
     |  (2.0,4.0)  (8.0,3.0)  |   (1.0,8.0)
 0   |  (5.0,1.0)  (5.0,1.0)  |   (1.0,6.0)
     |  (6.0,6.0)  (3.0,6.0)  |   (1.0,4.0)
-----|------------------------|-------------
 1   |  (3.0,3.0)  (7.0,5.0)  |   (1.0,7.0)
     |  (4.0,2.0)  (4.0,7.0)  |   (1.0,5.0)

Global general 3 × 2 matrix B with block size 2 × 2:

B,D              0
     *                       *
 0   |  (1.0,8.0)  (2.0,7.0) |
     |  (4.0,4.0)  (6.0,8.0) |
     | --------------------- |
 1   |  (6.0,2.0)  (4.0,5.0) |
     *                       *

The following is the 2 × 2 process grid:

B,D  |   0   |-- 
-----|-------|-----
0    |   P₀₀   |  P₀₁
-----|-------|-----
1    |   P₁₀   |  P₁₁

Local arrays for B:

p,q  |           0
-----|-----------------------
 0   |  (1.0,8.0)  (2.0,7.0)
     |  (4.0,4.0)  (6.0,8.0)
-----|-----------------------
 1   |  (6.0,2.0)  (4.0,5.0)

Global general 6 × 2 matrix C with block size 2 × 2:

B,D              0
     *                       *
 0   |  (0.5,0.0)  (0.5,0.0) |
     |  (0.5,0.0)  (0.5,0.0) |
     | --------------------- |
 1   |  (0.5,0.0)  (0.5,0.0) |
     |  (0.5,0.0)  (0.5,0.0) |
     | --------------------- |
 2   |  (0.5,0.0)  (0.5,0.0) |
     |  (0.5,0.0)  (0.5,0.0) |
     *                       *

The following is the 2 × 2 process grid:

B,D  |   0   |-- 
-----|-------|-----
0    |   P₀₀   |  P₀₁
2    |       |
-----|-------|-----
1    |   P₁₀   |  P₁₁

Local arrays for C:

p,q  |           0
-----|-----------------------
     |  (0.5,0.0)  (0.5,0.0)
     |  (0.5,0.0)  (0.5,0.0)
 0   |  (0.5,0.0)  (0.5,0.0)
     |  (0.5,0.0)  (0.5,0.0)
-----|-----------------------
 1   |  (0.5,0.0)  (0.5,0.0)
     |  (0.5,0.0)  (0.5,0.0)

Output:

Global general 6 × 2 matrix C with block size 2 × 2:

B,D                  0
     *                               *
 0   |  (-22.0,113.0)  (-35.0.142.0) |
     |  (-19.0,114.0)  (-35.0.141.0) |
     | ----------------------------- |
 1   |  (-20.0,119.0)  (-43.0.146.0) |
     |  (-27.0,110.0)  (-58.0.131.0) |
     | ----------------------------- |
 2   |  (8.0,103.0)    (0.0.112.0)   |
     |  (-55.0,116.0)  (-75.0.135.0) |
     *                               *

The following is the 2 × 2 process grid:

B,D  |   0   |-- 
-----|-------|-----
0    |   P₀₀   |  P₀₁
2    |       |
-----|-------|-----
1    |   P₁₀   |  P₁₁

Local arrays for C:

p,q  |               0
-----|-------------------------------
     |  (-22.0,113.0)  (-35.0.142.0)
     |  (-19.0,114.0)  (-35.0.141.0)
 0   |  (8.0,103.0)    (0.0.112.0)
     |  (-55.0,116.0)  (-75.0.135.0)
-----|-------------------------------
 1   |  (-20.0,119.0)  (-43.0.146.0)
     |  (-27.0,110.0)  (-58.0.131.0)

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

Parallel Engineering and Scientific Subroutine Library for AIX Version 2 Release 3: Guide and Reference

PDGEMM and PZGEMM--Matrix-Matrix Product for a General Matrix, Its Transpose, or Its Conjugate Transpose

Syntax

On Entry

On Return

Notes and Coding Rules

Error Conditions

Computational Errors

Resource Errors

Input-Argument and Miscellaneous Errors

Example 1

Call Statements and Input

Example 2

Call Statements and Input