Loop Interchange and Subscripts: Matrix Multiply

Loop interchange need unit-stride constructs to be vectorized. Matrix multiplication is commonly written as shown in the following example:

Example: Typical Matrix Multiplication
void matmul_slow(float a[], float b[], float c[]) { int N = 100; for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] b[k][j]; }

void matmul_slow(float *a[], float *b[], float *c[])

{

int N = 100;

for (int i = 0; i < N; i++)

for (int j = 0; j < N; j++)

for (int k = 0; k < N; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

The use of B(K,J)is not a stride-1 reference and therefore will not normally be vectorizable.

If the loops are interchanged, however, all the references will become stride-1 as shown in the following example.

Example: Matrix Multiplication with Stride-1
void matmul_fast(float a[], float b[], float c[]) { int N = 100; for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) c[i][j] = c[i][j] + a[i][k] b[k][j]; }

void matmul_fast(float *a[], float *b[], float *c[])

{

int N = 100;

for (int i = 0; i < N; i++)

for (int k = 0; k < N; k++)

for (int j = 0; j < N; j++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

Interchanging is not always possible because of dependencies, which can lead to different results.