Loop Interchange and Subscripts: Matrix Multiply

Loop interchange need unit-stride constructs to be vectorized. Matrix multiplication is commonly written as shown in the following example:

Example: Typical Matrix Multiplication

void matmul_slow(float *a[], float *b[], float *c[])

{

  int N = 100;

  for (int i = 0; i < N; i++)

    for (int j = 0; j < N; j++)

      for (int k = 0; k < N; k++)

        c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

The use of B(K,J)is not a stride-1 reference and therefore will not normally be vectorizable.

If the loops are interchanged, however, all the references will become stride-1 as shown in the following example.

Example: Matrix Multiplication with Stride-1

void matmul_fast(float *a[], float *b[], float *c[])

{

  int N = 100;

  for (int i = 0; i < N; i++)

    for (int k = 0; k < N; k++)

      for (int j = 0; j < N; j++)

        c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

Interchanging is not always possible because of dependencies, which can lead to different results.