Loop interchange need unit-stride constructs to be vectorized. Matrix multiplication is commonly written as shown in the following example:
Example: Typical Matrix Multiplication |
---|
void matmul_slow(float *a[], float *b[], float *c[]) { int N = 100; for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } |
The use of B(K,J)is not a stride-1 reference and therefore will not normally be vectorizable.
If the loops are interchanged, however, all the references will become stride-1 as shown in the following example.
Example: Matrix Multiplication with Stride-1 |
---|
void matmul_fast(float *a[], float *b[], float *c[]) { int N = 100; for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } |
Interchanging is not always possible because of dependencies, which can lead to different results.