IBM Books

Parallel Engineering and Scientific Subroutine Library for AIX Version 2 Release 3: Guide and Reference

PDGEMM and PZGEMM--Matrix-Matrix Product for a General Matrix, Its Transpose, or Its Conjugate Transpose

PDGEMM performs any one of the following combined matrix computations:

C<--alphaAB+betaC
C<--alphaABT+betaC
C<--alphaATB+betaC
C <-- alphaATBT+betaC

PZGEMM performs any one of the following combined matrix computations:

C <-- alphaAB+betaC
C<--alphaABT+betaC
C<--alphaATB+betaC
C <-- alphaATBT+betaC
C <--alphaAHB+betaC
C <-- alphaAHBT+betaC
C<--alphaABH+betaC
C <-- alphaATBH+betaC
C < alphaAHBH+betaC

where, in the PDGEMM and PZGEMM formulas above:

A represents the global general submatrix:
B represents the global general submatrix:
C represents the global general submatrix Cic:ic+m-1, jc:jc+n-1.
alpha and beta are scalars.

Note:
No data should be moved to form AT, AH, BT, or BH; that is, the A and B matrices should always be stored in their untransposed forms.

In the following four cases, no computation is performed and the subroutine returns after doing some parameter checking:

Assuming the above conditions do not exist, if beta is not one and k is 0, then betaC is returned.

See references [14] and [15].

Table 45. Data Types

A, B, C, alpha, beta Subroutine
Long-precision real PDGEMM
Long-precision complex PZGEMM

Syntax

Fortran CALL PDGEMM | PZGEMM (transa, transb, m, n, k, alpha, a, ia, ja, desc_a, b, ib, jb, desc_b, beta, c, ic, jc, desc_c)
C and C++ pdgemm | pzgemm (transa, transb, m, n, k, alpha, a, ia, ja, desc_a, b, ib, jb, desc_b, beta, c, ic, jc, desc_c);

On Entry

transa
indicates the form of matrix A to use in the computation, where:

If transa = 'N', A is used in the computation.

If transa = 'T', AT is used in the computation.

If transa = 'C', AH is used in the computation.

Scope: global

Specified as: a single character; transa = 'N', 'T', or 'C'

transb
indicates the form of matrix B to use in the computation, where:

If transb = 'N', B is used in the computation.

If transb = 'T', BT is used in the computation.

If transb = 'C', BH is used in the computation.

Scope: global

Specified as: a single character; transb = 'N', 'T', or 'C'

m
is the number of rows in submatrix C used in the computation, and:

If transa = 'N', it is the number of rows in submatrix A.

If transa = 'T' or 'C', it is the number of columns in submatrix A.

Scope: global

Specified as: a fullword integer; m >= 0.

n
is the number of columns in submatrix C used in the computation, and:

If transb = 'N', it is the number of columns in submatrix B.

If transb = 'T' or 'C', it is the number of rows in submatrix B.

Scope: global

Specified as: a fullword integer; n >= 0.

k
has the following meaning:

If transa = 'N', it is the number of columns in submatrix A.

If transa = 'T' or 'C', it is the number of rows in submatrix A.

In addition:

If transb = 'N', it is the number of rows in submatrix B.

If transb = 'T' or 'C', it is the number of columns in submatrix B.

Scope: global

Specified as: a fullword integer; k >= 0.

alpha
is the scalar alpha.

Scope: global

Specified as: a number of the data type indicated in Table 45.

a
is the local part of the global general matrix A. This identifies the first element of the local array A. This subroutine computes the location of the first element of the local subarray used, based on ia, ja, desc_a, p, q, myrow, and mycol; therefore:

Note:
No data should be moved to form AT or AH; that is, the matrix A should always be stored in its untransposed form.

Scope: local

Specified as: an LLD_A by (at least) LOCq(N_A) array, containing numbers of the data type indicated in Table 45. Details about the block-cyclic data distribution of global matrix A are stored in desc_a.

ia
is the row index of the global matrix A, identifying the first row of the submatrix A.

Scope: global

Specified as: a fullword integer; 1 <= ia <= M_A, and:

If transa = 'N', then ia+m-1 <= M_A.

If transa = 'T' or 'C', then ia+k-1 <= M_A.

ja
is the column index of the global matrix A, identifying the first column of the submatrix A.

Scope: global

Specified as: a fullword integer; 1 <= ja <= N_A, and:

If transa = 'N', then ja+k-1 <= N_A.

If transa = 'T' or 'C', then ja+m-1 <= N_A.

desc_a
is the array descriptor for global matrix A, described in the following table:
desc_a Name Description Limits Scope
1 DTYPE_A Descriptor type DTYPE_A=1 Global
2 CTXT_A BLACS context Valid value, as returned by BLACS_GRIDINIT or BLACS_GRIDMAP Global
3 M_A Number of rows in the global matrix If m = 0 or k = 0:
M_A >= 0
Otherwise:
M_A >= 1
Global
4 N_A Number of columns in the global matrix If m = 0 or k = 0:
N_A >= 0
Otherwise:
N_A >= 1
Global
5 MB_A Row block size MB_A >= 1 Global
6 NB_A Column block size NB_A >= 1 Global
7 RSRC_A The process row of the p × q grid over which the first row of the global matrix is distributed 0 <= RSRC_A < p Global
8 CSRC_A The process column of the p × q grid over which the first column of the global matrix is distributed 0 <= CSRC_A < q Global
9 LLD_A The leading dimension of the local array LLD_A >= max(1,LOCp(M_A)) Local

Specified as: an array of (at least) length 9, containing fullword integers.

b
is the local part of the global general matrix B. This identifies the first element of the local array B. This subroutine computes the location of the first element of the local subarray used, based on ib, jb, desc_b, p, q, myrow, and mycol; therefore:

Note:
No data should be moved to form BT or BH; that is, the matrix B should always be stored in its untransposed form.

Scope: local

Specified as: an LLD_B by (at least) LOCq(N_B) array, containing numbers of the data type indicated in Table 45. Details about the block-cyclic data distribution of global matrix B are stored in desc_b.

ib
is the row index of the global matrix B, identifying the first row of the submatrix B.

Scope: global

Specified as: a fullword integer; 1 <= ib <= M_B, and:

If transb = 'N', then ib+k-1 <= M_B.

If transb = 'T' or 'C', then ib+n-1 <= M_B.

jb
is the column index of the global matrix B, identifying the first column of the submatrix B.

Scope: global

Specified as: a fullword integer; 1 <= jb <= N_B, and:

If transb = 'N', then jb+n-1 <= N_B.

If transb = 'T' or 'C', then jb+k-1 <= N_B.

desc_b
is the array descriptor for global matrix B, described in the following table:
desc_b Name Description Limits Scope
1 DTYPE_B Descriptor type DTYPE_B=1 Global
2 CTXT_B BLACS context Valid value, as returned by BLACS_GRIDINIT or BLACS_GRIDMAP Global
3 M_B Number of rows in the global matrix If k = 0 or n = 0:
M_B >= 0
Otherwise:
M_B >= 1
Global
4 N_B Number of columns in the global matrix If k = 0 or n = 0:
N_B >= 0
Otherwise:
N_B >= 1
Global
5 MB_B Row block size MB_B >= 1 Global
6 NB_B Column block size NB_B >= 1 Global
7 RSRC_B The process row of the p × q grid over which the first row of the global matrix is distributed 0 <= RSRC_B < p Global
8 CSRC_B The process column of the p × q grid over which the first column of the global matrix is distributed 0 <= CSRC_B < q Global
9 LLD_B The leading dimension of the local array LLD_B >= max(1,LOCp(M_B)) Local

Specified as: an array of (at least) length 9, containing fullword integers.

beta
is the scalar beta.

Scope: global

Specified as: a number of the data type indicated in Table 45.

c
is the local part of the global general matrix C. This identifies the first element of the local array C. This subroutine computes the location of the first element of the local subarray used, based on ic, jc, desc_c, p, q, myrow, and mycol; therefore, the leading LOCp(ic+m-1) by LOCq(jc+n-1) part of the local array C must contain the local pieces of the leading ic+m-1 by jc+n-1 part of the global matrix.

When beta is zero, C need not be set on input.

Scope: local

Specified as: an LLD_C by (at least) LOCq(N_C) array, containing numbers of the data type indicated in Table 45. Details about the block-cyclic data distribution of global matrix C are stored in desc_c.

ic
is the row index of the global matrix C, identifying the first row of the submatrix C.

Scope: global

Specified as: a fullword integer; 1 <= ic <= M_C and ic+m-1 <= M_C.

jc
is the column index of the global matrix C, identifying the first column of the submatrix C.

Scope: global

Specified as: a fullword integer; 1 <= jc <= N_C and jc+n-1 <= N_C.

desc_c
is the array descriptor for global matrix C, described in the following table:
desc_c Name Description Limits Scope
1 DTYPE_C Descriptor type DTYPE_C=1 Global
2 CTXT_C BLACS context Valid value, as returned by BLACS_GRIDINIT or BLACS_GRIDMAP Global
3 M_C Number of rows in the global matrix If m = 0 or n = 0:
M_C >= 0
Otherwise:
M_C >= 1
Global
4 N_C Number of columns in the global matrix If m = 0 or n = 0:
N_C >= 0
Otherwise:
N_C >= 1
Global
5 MB_C Row block size MB_C >= 1 Global
6 NB_C Column block size NB_C >= 1 Global
7 RSRC_C The process row of the p × q grid over which the first row of the global matrix is distributed 0 <= RSRC_C < p Global
8 CSRC_C The process column of the p × q grid over which the first column of the global matrix is distributed 0 <= CSRC_C < q Global
9 LLD_C The leading dimension of the local array LLD_C >= max(1,LOCp(M_C)) Local

Specified as: an array of (at least) length 9, containing fullword integers.

On Return

c
is the updated local part of the global matrix C, containing the results of the computation.

Scope: local

Returned as: an LLD_C by (at least) LOCq(N_C) array, containing numbers of the data type indicated in Table 45.

Notes and Coding Rules
  1. These subroutines accept lowercase letters for the transa and transb arguments.
  2. For PDGEMM, if you specify 'C' for the transa or transb argument, it is interpreted as though you specified 'T'.
  3. The matrices must have no common elements; otherwise, results are unpredictable.
  4. The NUMROC utility subroutine can be used to determine the values of LOCp(M_) and LOCq(N_) used in the argument descriptions above. For details, see Determining the Number of Rows and Columns in Your Local Arrays and NUMROC--Compute the Number of Rows or Columns of a Block-Cyclically Distributed Matrix Contained in a Process.
  5. For suggested block sizes, see Coding Tips for Optimizing Parallel Performance.
  6. The following values must be equal: CTXT_A = CTXT_B = CTXT_C.
  7. The coding rules described in this note depend upon which matrix--A, B, or C--is used as the reference matrix, which is referred to, in general, as matrix X. For each of the three possible selections for the reference matrix, there is a unique set of coding rules that must be met. These are detailed in Table 46 and Table 47. Follow these steps to select a reference matrix and determine what coding rules to use:

    Step 1: First, the reference matrix is selected. For optimal performance, the reference matrix is selected based on the arguments m, n, and k, as follows:

    If k <= min(m, n), then X = C
    If n <= min(m, k), then X = A
    If m <= min(n, k), then X = B

    The matrix selected must satisfy coding rules a and d, described below, to be a suitable reference matrix. If it does, you go to step 2. If it does not, then it checks to see if either of the other two matrices satisfies coding rules a, c, and d, making one of them a suitable reference matrix. If one of them is suitable, then you go to step 2. If neither matrix is suitable, an error condition results.

    Step 2: After a suitable reference matrix is chosen in Step 2, all remaining coding rules, described below, are checked. If the rules are satisfied, the subroutine continues normally. If they are not, an error condition results.

    Coding Rules: Following are the coding rules:

    1. The reference matrix must be aligned on a block boundary; that is:
      ix-1 must be a multiple of MB_X.
      jx-1 must be a multiple of NB_X.

      These indexes are indicated in column 5 of Table 46 for each entry for X.

    2. The block sizes that must be equal are indicated in column 4 of Table 46 for each entry for X. The rules for block sizes depend only upon the values of transa and transb, and not on the reference matrix selected; however, for your convenience, the rules are repeated in the table for each reference matrix.
    3. Given the reference matrix X, additional rules apply to the block row and block column offsets of the two nonreference matrices. These rules are listed in column 7 of Table 46 for each entry for X. These rules must only be met when looping is required--that is, either of the conditions in column 8 is met.
    4. The indexes of the nonreference matrices, which need to be on a block boundary, are listed in column 6 of Table 46 for each entry for X.

      Table 46. Coding Rules for the Reference Matrix X

      -1-
      X
      -2-
      transa
      -3-
      transb
      -4-
      (b)
      Equal
      Block Sizes
      -5-
      (a)
      Block
      Bndry
      For
      X
      -6-
      (d)
      Block
      Bndry
      For
      Other
      -7-
      (c)
      Equal
      Block Offsets
      (If Looping
      is Required)
      -8-
      (c)
      Conditions
      For Looping
      A 'N' 'N' MB_A = MB_C
      NB_B = NB_C
      NB_A = MB_B
      ia, ja ib, ic mod(jb-1, NB_B)
      =
      mod(jc-1, NB_C)
      n+mod(jb-1, NB_B) > NB_B
      -or-
      n+mod(jc-1, NB_C) > NB_C
      A 'N' 'T' or 'C' MB_A = MB_C
      MB_B = NB_C
      NB_A = NB_B
      ia, ja jb, ic mod(ib-1, MB_B)
      =
      mod(jc-1, NB_C)
      n+mod(ib-1, MB_B) > MB_B
      -or-
      n+mod(jc-1, NB_C) > NB_C
      A 'T' or 'C' 'N' NB_A = MB_C
      NB_B = NB_C
      MB_A = MB_B
      ia, ja ib, ic mod(jb-1, NB_B)
      =
      mod(jc-1, NB_C)
      n+mod(jb-1, NB_B) > NB_B
      -or-
      n+mod(jc-1, NB_C) > NB_C
      A 'T' or 'C' 'T' or 'C' NB_A = MB_C
      MB_B = NB_C
      MB_A = NB_B
      ia, ja jb, ic mod(ib-1, MB_B)
      =
      mod(jc-1, NB_C)
      n+mod(ib-1, MB_B) > MB_B
      -or-
      n+mod(jc-1, NB_C) > NB_C
      B 'N' 'N' MB_A = MB_C
      NB_B = NB_C
      NB_A = MB_B
      ib, jb ja, jc mod(ia-1, MB_A)
      =
      mod(ic-1, MB_C)
      m+mod(ia-1, MB_A) > MB_A
      -or-
      m+mod(ic-1, MB_C) > MB_C
      B 'N' 'T' or 'C' MB_A = MB_C
      MB_B = NB_C
      NB_A = NB_B
      ib, jb ja, jc mod(ia-1, MB_A)
      =
      mod(ic-1, MB_C)
      m+mod(ia-1, MB_A) > MB_A
      -or-
      m+mod(ic-1, MB_C) > MB_C
      B 'T' or 'C' 'N' NB_A = MB_C
      NB_B = NB_C
      MB_A = MB_B
      ib, jb ia, jc mod(ja-1, NB_A)
      =
      mod(ic-1, MB_C)
      m+mod(ja-1, NB_A) > NB_A
      -or-
      m+mod(ic-1, MB_C) > MB_C
      B 'T' or 'C' 'T' or 'C' NB_A = MB_C
      MB_B = NB_C
      MB_A = NB_B
      ib, jb ia, jc mod(ja-1, NB_A)
      =
      mod(ic-1, MB_C)
      m+mod(ja-1, NB_A) > NB_A
      -or-
      m+mod(ic-1, MB_C) > MB_C
      C 'N' 'N' MB_A = MB_C
      NB_B = NB_C
      NB_A = MB_B
      ic, jc ia, jb mod(ja-1, NB_A)
      =
      mod(ib-1, MB_B)
      k+mod(ja-1, NB_A) > NB_A
      -or-
      k+mod(ib-1, MB_B) > MB_B
      C 'N' 'T' or 'C' MB_A = MB_C
      MB_B = NB_C
      NB_A = NB_B
      ic, jc ia, ib mod(ja-1, NB_A)
      =
      mod(jb-1, NB_B)
      k+mod(ja-1, NB_A) > NB_A
      -or-
      k+mod(jb-1, NB_B) > NB_B
      C 'T' or 'C' 'N' NB_A = MB_C
      NB_B = NB_C
      MB_A = MB_B
      ic, jc ja, jb mod(ia-1, MB_A)
      =
      mod(ib-1, MB_B)
      k+mod(ia-1, MB_A) > MB_A
      -or-
      k+mod(ib-1, MB_B) > MB_B
      C 'T' or 'C' 'T' or 'C' NB_A = MB_C
      MB_B = NB_C
      MB_A = NB_B
      ic, jc ja, ib mod(ia-1, MB_A)
      =
      mod(jb-1, NB_B)
      k+mod(ia-1, MB_A) > MB_A
      -or-
      k+mod(jb-1, NB_B) > NB_B
    5. Additional rules apply to the row and column alignment of the various matrices in the process grid; specifically, the process row or process column containing the first row or column of the reference submatrix X, respectively, must also contain the first row or column of one of the other two nonreference submatrices, as indicated in column 4 of Table 47 for each entry for X. Following is the definition of ixrow and ixcol, which holds true for A, B, and C:
      ixrow = mod((((ix-1)/MB_X)+RSRC_X), p)
      ixcol = mod((((jx-1)/NB_X)+CSRC_X), q)


      Table 47. Coding Rules for the Reference Matrix X

      -1-
      X
      -2-
      transa
      -3-
      transb
      -4-
      (e)
      Process Grid
      Alignment
      A 'N' 'N' iarow = icrow
      A 'N' 'T' or 'C' iarow = icrow
      ibcol = iacol
      A 'T' or 'C' 'N' iarow = ibrow
      A 'T' or 'C' 'T' or 'C' (no rules)
      B 'N' 'N' ibcol = iccol
      B 'N' 'T' or 'C' ibcol = iacol
      B 'T' or 'C' 'N' iarow = ibrow
      ibcol = iccol
      B 'T' or 'C' 'T' or 'C' (no rules)
      C 'N' 'N' iarow = icrow
      ibcol = iccol
      C 'N' 'T' or 'C' iarow = icrow
      C 'T' or 'C' 'N' ibcol = iccol
      C 'T' or 'C' 'T' or 'C' (no rules)

    Example: Following is an example of the coding rules necessary for the case where transa = 'N' and transb = 'N', where the reference matrix selected is A. Following are the indexes, dimensions, and block sizes used in the computation for the matrices:



    Indexes:        ic  jc             ia  ja        ib  jb             ic  jc
                     |   |              |   |         |   |              |   |
    Dimensions:  C ( m , n )  
    <--  alpha  A ( m , k )   B ( k , n )  +  beta  C ( m , n )
                     |   |              |   |         |   |              |   |
    Block Sizes:   MB_C NB_C          MB_A NB_A     MB_B NB_B          MB_C NB_C
    
    1. A must be aligned on a block boundary, as indicated in column 5 in Table 46:
      ia-1 must be a multiple of MB_A.
      ja-1 must be a multiple of NB_A.
    2. The block sizes that correspond to each matrix dimension must be equal, where MB_ represents the row dimension and NB_ represents the column dimension, as indicated in column 4 in Table 46:
      MB_A = MB_C
      NB_B = NB_C
      NB_A = MB_B
    3. As shown above, m and k are the dimensions of the reference matrix A; therefore, n is used to determine if looping is required; that is, if one of the following is true, as indicated in column 8 in Table 46:
      n+mod(jc-1, NB_C) > NB_C
      n+mod(jb-1, NB_B) > NB_B

      then the following offsets must be equal, as indicated in column 7 in Table 46:

      mod(jb-1, NB_B) = mod(jc-1, NB_C)
    4. The other indexes from each of the nonreference matrices--not used in c above--must be aligned on a block boundary, as indicated in column 6 in Table 46:
      ic-1 must be a multiple of MB_C.
      ib-1 must be a multiple of MB_B.
    5. In the process grid, the process row containing the first row of the submatrix A must also contain the first row of the submatrix C, as indicated in column 4 in Table 47; that is, iarow = icrow, where:
      iarow = mod((((ia-1)/MB_A)+RSRC_A), p)
      icrow = mod((((ic-1)/MB_C)+RSRC_C), p)

Error Conditions

Computational Errors

None

Resource Errors

Unable to allocate work space

Input-Argument and Miscellaneous Errors

Stage 1 

  1. DTYPE_A is invalid.
  2. DTYPE_B is invalid.
  3. DTYPE_C is invalid.

Stage 2 

  1. CTXT_A is invalid.

Stage 3 

  1. The subroutine was called from outside the process grid.

Stage 4 

  1. transa <> 'N', 'T', or 'C'
  2. transb <> 'N', 'T', or 'C'
  3. m < 0
  4. n < 0
  5. k < 0
  6. M_A < 0 and (m = 0 or k = 0); M_A < 1 otherwise
  7. N_A < 0 and (m = 0 or k = 0); N_A < 1 otherwise
  8. M_B < 0 and (k = 0 or n = 0); M_B < 1 otherwise
  9. N_B < 0 and (k = 0 or n = 0); N_B < 1 otherwise
  10. M_C < 0 and (m = 0 or n = 0); M_C < 1 otherwise
  11. N_C < 0 and (m = 0 or n = 0); N_C < 1 otherwise
  12. ia < 1
  13. ib < 1
  14. ic < 1
  15. ja < 1
  16. jb < 1
  17. jc < 1
  18. MB_A < 1
  19. MB_B < 1
  20. MB_C < 1
  21. NB_A < 1
  22. NB_B < 1
  23. NB_C < 1
  24. RSRC_A < 0 or RSRC_A >= p
  25. RSRC_B < 0 or RSRC_B >= p
  26. RSRC_C < 0 or RSRC_C >= p
  27. CSRC_A < 0 or CSRC_A >= q
  28. CSRC_B < 0 or CSRC_B >= q
  29. CSRC_C < 0 or CSRC_C >= q
  30. CTXT_A <> CTXT_B
  31. CTXT_A <> CTXT_C

Stage 5 

    If m <> 0 and k <> 0:

  1. transa = 'N' and ia+m-1 > M_A
  2. transa = 'T' or 'C' and ia+k-1 > M_A
  3. transa = 'N' and ja+k-1 > N_A
  4. transa = 'T' or 'C' and ja+m-1 > N_A
  5. ia > M_A
  6. ja > N_A

    If n <> 0 and k <> 0:

  7. transb = 'N' and ib+k-1 > M_B
  8. transb = 'T' or 'C' and ib+n-1 > M_B
  9. transb = 'N' and jb+n-1 > N_B
  10. transb = 'T' or 'C' and jb+k-1 > N_B
  11. ib > M_B
  12. jb > N_B

    If m <> 0 and n <> 0:

  13. ic+m-1 > M_C
  14. jc+n-1 > N_C
  15. ic > M_C
  16. jc > N_C
  17. For the reference matrix (defined in note 7 in Notes and Coding Rules) and the appropriate transa and transb values, the indexes listed in column 5 of Table 46 are not aligned on a block boundary, where boundary alignment is defined as:
    ix-1 must be a multiple of MB_X.
    jx-1 must be a multiple of NB_X.
  18. For the two nonreference matrices (defined in note 7 in Notes and Coding Rules) and the appropriate transa and transb values, the indexes listed in column 6 of Table 46 are not aligned on a block boundary. Using Z to represent one of the nonreference matrices, each boundary alignment is expressed as one of the following:
    iz-1 must be a multiple of MB_Z.
    jz-1 must be a multiple of NB_Z.
  19. For the reference matrix (defined in note 7 in Notes and Coding Rules) and the appropriate transa and transb values, if looping occurs--that is, one of the conditions in column 8 of Table 46 is true--then the block offsets indicated in column 7 are not equal.

Stage 6 

  1. For the appropriate transa and transb values indicated in Table 46 (where the reference matrix does not matter), some of the block sizes indicated in column 4 are not equal.
  2. LLD_A < max(1, LOCp(M_A))
  3. LLD_B < max(1, LOCp(M_B))
  4. LLD_C < max(1, LOCp(M_C))
  5. In the process grid, the process row or process column containing the first row or column of the reference submatrix X (defined in note 7 in Notes and Coding Rules), respectively, does not contain the first row or column of one of the other two nonreference submatrices, as indicated in column 4 of Table 47. Following is the definition of ixrow and ixcol, which holds true for A, B, and C:
    ixrow = mod((((ix-1)/MB_X)+RSRC_X), p)
    ixcol = mod((((jx-1)/NB_X)+CSRC_X), q)

Example 1

This example computes C = betaC+alphaAB using a 2 × 2 process grid.

Call Statements and Input


 ORDER = 'R'
 NPROW = 2
 NPCOL = 2
 CALL BLACS_GET (0, 0, ICONTXT)
 CALL BLACS_GRIDINIT(ICONTXT, ORDER, NPROW, NPCOL)
 CALL BLACS_GRIDINFO(ICONTXT, NPROW, NPCOL, MYROW, MYCOL)
 
            TRANSA TRANSB  M    N    K   ALPHA    A  IA  JA   DESC_A   B  IB  JB
               |      |    |    |    |     |      |   |   |     |      |   |   |
 CALL PDGEMM( 'N' ,  'N' , 6  , 4  , 5 , 1.0D0  , A , 1 , 1 , DESC_A , B , 1 , 1 ,
 
              DESC_B    BETA    C  IC  JC   DESC_C
                |         |     |   |   |     |
              DESC_B ,  2.0D0 , C , 1 , 1 , DESC_C )


Desc_A Desc_B Desc_C
DTYPE_ 1 1 1
CTXT_ icontxt(IOBG18) icontxt(IOBG18) icontxt(IOBG18)
M_ 6 5 6
N_ 5 4 4
MB_ 3 2 3
NB_ 2 2 2
RSRC_ 0 0 0
CSRC_ 0 0 0
LLD_ See below(EPSSL18) See below(EPSSL18) See below(EPSSL18)

Notes:

  1. icontxt is the output of the BLACS_GRIDINIT call.

  2. Each process should set the LLD_ as follows:
    LLD_A = MAX(1,NUMROC(M_A, MB_A, MYROW, RSRC_A, NPROW))
    LLD_B = MAX(1,NUMROC(M_B, MB_B, MYROW, RSRC_B, NPROW))
    LLD_C = MAX(1,NUMROC(M_C, MB_C, MYROW, RSRC_C, NPROW))
    

    In this example, LLD_A = LLD_C = 3 on all processes, and LLD_B = 3 on P10 and P01 and LLD_B = 2 on P10 and P11.

Global general 6 × 5 matrix A with block size 3 × 2:

B,D        0             1          2
     *                                  *
     |  1.0  2.0  |  -1.0 -1.0  |   4.0 |
 0   |  2.0  0.0  |   1.0  1.0  |  -1.0 |
     |  1.0 -1.0  |  -1.0  1.0  |   2.0 |
     | -----------|-------------|------ |
     | -3.0  2.0  |   2.0  2.0  |   0.0 |
 1   |  4.0  0.0  |  -2.0  1.0  |  -1.0 |
     | -1.0 -1.0  |   1.0 -3.0  |   2.0 |
     *                                  *

The following is the 2 × 2 process grid:

B,D  |   0 2   |  1  
-----| ------- |-----
0    |   P00   |  P01
-----| ------- |-----
1    |   P10   |  P11

Local arrays for A:

p,q  |       0         |      1
-----|-----------------|------------
     |  1.0  2.0  4.0  |  -1.0 -1.0
 0   |  2.0  0.0 -1.0  |   1.0  1.0
     |  1.0 -1.0  2.0  |  -1.0  1.0
-----|-----------------|------------
     | -3.0  2.0  0.0  |   2.0  2.0
 1   |  4.0  0.0 -1.0  |  -2.0  1.0
     | -1.0 -1.0  2.0  |   1.0 -3.0

Global general 5 × 4 matrix B with block size 2 × 2:

B,D        0             1
     *                         *
 0   |  1.0 -1.0  |   0.0  2.0 |
     |  2.0  2.0  |  -1.0 -2.0 |
     | -----------|----------- |
 1   |  1.0  0.0  |  -1.0  1.0 |
     | -3.0 -1.0  |   1.0 -1.0 |
     | -----------|----------- |
 2   |  4.0  2.0  |  -1.0  1.0 |
     *                         *

The following is the 2 × 2 process grid:

B,D  |    0    |  1  
-----| ------- |-----
0    |   P00   |  P01
2    |         |
-----| ------- |-----
1    |   P10   |  P11

Local arrays for B:

p,q  |     0      |      1
-----|------------|------------
     |  1.0 -1.0  |   0.0  2.0
 0   |  2.0  2.0  |  -1.0 -2.0
     |  4.0  2.0  |  -1.0  1.0
-----|------------|------------
 1   |  1.0  0.0  |  -1.0  0.0
     | -3.0 -1.0  |   1.0 -1.0

Global general 6 × 4 matrix C with block size 3 × 2:

B,D        0             1
     *                         *
     |  0.5  0.5  |   0.5  0.5 |
 0   |  0.5  0.5  |   0.5  0.5 |
     |  0.5  0.5  |   0.5  0.5 |
     | -----------|----------- |
     |  0.5  0.5  |   0.5  0.5 |
 1   |  0.5  0.5  |   0.5  0.5 |
     |  0.5  0.5  |   0.5  0.5 |
     *                         *

The following is the 2 × 2 process grid:

B,D  |    0    |  1  
-----| ------- |-----
0    |   P00   |  P01
-----| ------- |-----
1    |   P10   |  P11

Local arrays for C:

p,q  |     0      |      1
-----|------------|------------
     |  0.5  0.5  |   0.5  0.5
 0   |  0.5  0.5  |   0.5  0.5
     |  0.5  0.5  |   0.5  0.5
-----|------------|------------
     |  0.5  0.5  |   0.5  0.5
 1   |  0.5  0.5  |   0.5  0.5
     |  0.5  0.5  |   0.5  0.5

Output:

Global general 6 × 4 matrix C with block size 3 × 2:

B,D         0               1
     *                             *
     |  24.0  13.0  |   -5.0   3.0 |
 0   |  -3.0  -4.0  |    2.0   4.0 |
     |   4.0   1.0  |    2.0   5.0 |
     | -------------|------------- |
     |  -2.0   6.0  |   -1.0  -9.0 |
 1   |  -4.0  -6.0  |    5.0   5.0 |
     |  16.0   7.0  |   -4.0   7.0 |
     *                             *

The following is the 2 × 2 process grid:

B,D  |    0    |  1  
-----| ------- |-----
0    |   P00   |  P01
-----| ------- |-----
1    |   P10   |  P11

Local arrays for C:

p,q  |      0       |       1
-----|--------------|--------------
     |  24.0  13.0  |   -5.0   3.0
 0   |  -3.0  -4.0  |    2.0   4.0
     |   4.0   1.0  |    2.0   5.0
-----|--------------|--------------
     |  -2.0   6.0  |   -1.0  -9.0
 1   |  -4.0  -6.0  |    5.0   5.0
     |  16.0   7.0  |   -4.0   7.0

Example 2

This example computes C = betaC+alphaAB using a 2 × 2 process grid.

Call Statements and Input


 ORDER = 'R'
 NPROW = 2
 NPCOL = 2
 CALL BLACS_GET (0, 0, ICONTXT)
 CALL BLACS_GRIDINIT(ICONTXT, ORDER, NPROW, NPCOL)
 CALL BLACS_GRIDINFO(ICONTXT, NPROW, NPCOL, MYROW, MYCOL)
 
           TRANSA TRANSB M   N   K       ALPHA       A  IA  JA   DESC_A   B  IB  JB
              |     |    |   |   |         |         |   |   |     |      |   |   |
 CALL PZGEMM('N' , 'N' , 6 , 2 , 3 , (1.0D0,0.0D0) , A , 1 , 1 , DESC_A , B , 1 , 1 ,
 
              DESC_B       BETA        C  IC  JC   DESC_C
                |            |         |   |   |     |
              DESC_B , (2.0D0,0.0D0) , C , 1 , 1 , DESC_C)


Desc_A Desc_B Desc_C
DTYPE_ 1 1 1
CTXT_ icontxt(IOBG19) icontxt(IOBG19) icontxt(IOBG19)
M_ 6 3 6
N_ 3 2 2
MB_ 2 2 2
NB_ 2 2 2
RSRC_ 0 0 0
CSRC_ 0 0 0
LLD_ See below(EPSSL19) See below(EPSSL19) See below(EPSSL19)

Notes:

  1. icontxt is the output of the BLACS_GRIDINIT call.

  2. Each process should set the LLD_ as follows:
    LLD_A = MAX(1,NUMROC(M_A, MB_A, MYROW, RSRC_A, NPROW))
    LLD_B = MAX(1,NUMROC(M_B, MB_B, MYROW, RSRC_B, NPROW))
    LLD_C = MAX(1,NUMROC(M_C, MB_C, MYROW, RSRC_C, NPROW))
    

    In this example, LLD_A = 4 on P00 and P01 and LLD_A = 2 on P10 and P11. LLD_B = 2 on P00 and LLD_B = 1 on P10. LLD_C = 4 on P00 and LLD_C = 2 on P10.

Global general 6 × 3 matrix A with block size 2 × 2:

B,D              0                   1
     *                                      *
 0   |  (1.0,5.0)  (9.0,2.0)  |   (1.0,9.0) |
     |  (2.0,4.0)  (8.0,3.0)  |   (1.0,8.0) |
     | -----------------------|------------ |
 1   |  (3.0,3.0)  (7.0,5.0)  |   (1.0,7.0) |
     |  (4.0,2.0)  (4.0,7.0)  |   (1.0,5.0) |
     | -----------------------|------------ |
 2   |  (5.0,1.0)  (5.0,1.0)  |   (1.0,6.0) |
     |  (6.0,6.0)  (3.0,6.0)  |   (1.0,4.0) |
     *                                      *

The following is the 2 × 2 process grid:

B,D  |    0    |  1  
-----| ------- |-----
0    |   P00   |  P01
2    |         |
-----| ------- |-----
1    |   P10   |  P11

Local arrays for A:

p,q  |           0            |      1
-----|------------------------|-------------
     |  (1.0,5.0)  (9.0,2.0)  |   (1.0,9.0)
     |  (2.0,4.0)  (8.0,3.0)  |   (1.0,8.0)
 0   |  (5.0,1.0)  (5.0,1.0)  |   (1.0,6.0)
     |  (6.0,6.0)  (3.0,6.0)  |   (1.0,4.0)
-----|------------------------|-------------
 1   |  (3.0,3.0)  (7.0,5.0)  |   (1.0,7.0)
     |  (4.0,2.0)  (4.0,7.0)  |   (1.0,5.0)

Global general 3 × 2 matrix B with block size 2 × 2:

B,D              0
     *                       *
 0   |  (1.0,8.0)  (2.0,7.0) |
     |  (4.0,4.0)  (6.0,8.0) |
     | --------------------- |
 1   |  (6.0,2.0)  (4.0,5.0) |
     *                       *

The following is the 2 × 2 process grid:

B,D  |    0    | --  
-----| ------- |-----
0    |   P00   |  P01
-----| ------- |-----
1    |   P10   |  P11

Local arrays for B:

p,q  |           0
-----|-----------------------
 0   |  (1.0,8.0)  (2.0,7.0)
     |  (4.0,4.0)  (6.0,8.0)
-----|-----------------------
 1   |  (6.0,2.0)  (4.0,5.0)

Global general 6 × 2 matrix C with block size 2 × 2:

B,D              0
     *                       *
 0   |  (0.5,0.0)  (0.5,0.0) |
     |  (0.5,0.0)  (0.5,0.0) |
     | --------------------- |
 1   |  (0.5,0.0)  (0.5,0.0) |
     |  (0.5,0.0)  (0.5,0.0) |
     | --------------------- |
 2   |  (0.5,0.0)  (0.5,0.0) |
     |  (0.5,0.0)  (0.5,0.0) |
     *                       *

The following is the 2 × 2 process grid:

B,D  |    0    | --  
-----| ------- |-----
0    |   P00   |  P01
2    |         |
-----| ------- |-----
1    |   P10   |  P11

Local arrays for C:

p,q  |           0
-----|-----------------------
     |  (0.5,0.0)  (0.5,0.0)
     |  (0.5,0.0)  (0.5,0.0)
 0   |  (0.5,0.0)  (0.5,0.0)
     |  (0.5,0.0)  (0.5,0.0)
-----|-----------------------
 1   |  (0.5,0.0)  (0.5,0.0)
     |  (0.5,0.0)  (0.5,0.0)

Output:

Global general 6 × 2 matrix C with block size 2 × 2:

B,D                  0
     *                               *
 0   |  (-22.0,113.0)  (-35.0.142.0) |
     |  (-19.0,114.0)  (-35.0.141.0) |
     | ----------------------------- |
 1   |  (-20.0,119.0)  (-43.0.146.0) |
     |  (-27.0,110.0)  (-58.0.131.0) |
     | ----------------------------- |
 2   |  (8.0,103.0)    (0.0.112.0)   |
     |  (-55.0,116.0)  (-75.0.135.0) |
     *                               *

The following is the 2 × 2 process grid:

B,D  |    0    | --  
-----| ------- |-----
0    |   P00   |  P01
2    |         |
-----| ------- |-----
1    |   P10   |  P11

Local arrays for C:

p,q  |               0
-----|-------------------------------
     |  (-22.0,113.0)  (-35.0.142.0)
     |  (-19.0,114.0)  (-35.0.141.0)
 0   |  (8.0,103.0)    (0.0.112.0)
     |  (-55.0,116.0)  (-75.0.135.0)
-----|-------------------------------
 1   |  (-20.0,119.0)  (-43.0.146.0)
     |  (-27.0,110.0)  (-58.0.131.0)


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]