Strip-mining and Cleanup

Strip-mining, also known as loop sectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as a means of improving memory performance. By fragmenting a large loop into smaller segments or strips, this technique transforms the loop structure in two ways:

It increases the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm.
It reduces the number of iterations of the loop by a factor of the length of each vector, or number of operations being performed per SIMD operation. In the case of Streaming SIMD Extensions, this vector or strip-length is reduced by 4 times: four floating-point data items per single Streaming SIMD Extensions single-precision floating-point SIMD operation are processed.

First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine.

The compiler automatically strip-mines your loop and generates a cleanup loop. For example, assume the compiler attempts to strip-mine the following loop:

Example1: Before Vectorization
i=0; while(i<n) { // Original loop code a[i]=b[i]+c[i]; ++i; }

Example1: Before Vectorization

i=0;

while(i<n)

{

// Original loop code

a[i]=b[i]+c[i];

++i;

}

The compiler might handle the strip mining and loop cleaning by restructuring the loop in the following manner:

Example 2: After Vectorization
// The vectorizer generates the following two loops i=0; while(i<(n-n%4)) { // Vector strip-mined loop // Subscript [i:i+3] denotes SIMD execution a[i:i+3]=b[i:i+3]+c[i:i+3]; i=i+4; } while(i<n) { // Scalar clean-up loop a[i]=b[i]+c[i]; ++i; }

Example 2: After Vectorization

// The vectorizer generates the following two loops

i=0;

while(i<(n-n%4))

{

// Vector strip-mined loop

// Subscript [i:i+3] denotes SIMD execution

a[i:i+3]=b[i:i+3]+c[i:i+3];

i=i+4;

}

while(i<n)

{

// Scalar clean-up loop

a[i]=b[i]+c[i];

++i;

}

Loop Blocking

It is possible to treat loop blocking as strip-mining in two or more dimensions. Loop blocking is a useful technique for memory performance optimization. The main purpose of loop blocking is to eliminate as many cache misses as possible. This technique transforms the memory domain into smaller chunks rather than sequentially traversing through the entire memory domain. Each chunk should be small enough to fit all the data for a given computation into the cache, thereby maximizing data reuse.

Consider the following example, loop blocking allows arrays A and B to be blocked into smaller rectangular chunks so that the total combined size of two blocked (A and B) chunks is smaller than cache size, which can improve data reuse.

Example 3: Original loop
#include <time.h> #include <stdio.h> #define MAX 7000 void add(int a[][MAX], int b[][MAX]); int main() { int i, j; int A[MAX][MAX]; int B[MAX][MAX]; time_t start, elaspe; int sec; //Initialize array for(i=0;i<MAX;i++) { for(j=0;j<MAX; j++) { A[i][j]=j; B[i][j]=j; } } start= time(NULL); add(A, B); elaspe=time(NULL); sec = elaspe - start; printf("Time %d",sec); //List time taken to complete add function } void add(int a[][MAX], int b[][MAX]) { int i, j; for(i=0;i<MAX;i++) { for(j=0; j<MAX;j++) { a[i][j] = a[i][j] + b[j][i]; //Adds two matrices } } }

Example 3: Original loop

#include <time.h>

#include <stdio.h>

#define MAX 7000

void add(int a[][MAX], int b[][MAX]);

int main()

{

int i, j;

int A[MAX][MAX];

int B[MAX][MAX];

time_t start, elaspe;

int sec;

//Initialize array

for(i=0;i<MAX;i++)

{

for(j=0;j<MAX; j++)

{

A[i][j]=j;

B[i][j]=j;

}

start= time(NULL);

add(A, B);

elaspe=time(NULL);

sec = elaspe - start;

printf("Time %d",sec); //List time taken to complete add function

}

void add(int a[][MAX], int b[][MAX])

{

int i, j;

for(i=0;i<MAX;i++)

{

for(j=0; j<MAX;j++)

{

a[i][j] = a[i][j] + b[j][i]; //Adds two matrices

}

The following example illustrates loop blocking the add function (from the previous example). In order to benefit from this optimization you might have to increase the cache size.

Example 4: Transformed Loop after blocking
#include <stdio.h> #include <time.h> #define MAX 7000 void add(int a[][MAX], int b[][MAX]); int main() { #define BS 8 //Block size is selected as the loop-blocking factor. int i, j; int A[MAX][MAX]; int B[MAX][MAX]; time_t start, elaspe; int sec; //initialize array for(i=0;i<MAX;i++) { for(j=0;j<MAX;j++) { A[i][j]=j; B[i][j]=j; } } start= time(NULL); add(A, B); elaspe=time(NULL); sec = elaspe - start; printf("Time %d",sec); //Display time taken to complete loopBlocking function } void add(int a[][MAX], int b[][MAX]) { int i, j, ii, jj; for(i=0;i<MAX;i+=BS) { for(j=0; j<MAX;j+=BS) { for(ii=i; ii<i+BS; ii++)//outer loop { for(jj=j;jj<j+BS; jj++) //Array B experiences one cache miss { //for every iteration of outer loop a[ii][jj] = a[ii][jj] + b[jj][ii]; //Add the two arrays } } } } }

Example 4: Transformed Loop after blocking

#include <stdio.h>

#include <time.h>

#define MAX 7000

void add(int a[][MAX], int b[][MAX]);

int main()

{

#define BS 8 //Block size is selected as the loop-blocking factor.

int i, j;

int A[MAX][MAX];

int B[MAX][MAX];

time_t start, elaspe;

int sec;

//initialize array

for(i=0;i<MAX;i++)

{

for(j=0;j<MAX;j++)

{

A[i][j]=j;

B[i][j]=j;

}

start= time(NULL);

add(A, B);

elaspe=time(NULL);

sec = elaspe - start;

printf("Time %d",sec); //Display time taken to complete loopBlocking function

}

void add(int a[][MAX], int b[][MAX])

{

int i, j, ii, jj;

for(i=0;i<MAX;i+=BS)

{

for(j=0; j<MAX;j+=BS)

{

for(ii=i; ii<i+BS; ii++)//outer loop

{

for(jj=j;jj<j+BS; jj++) //Array B experiences one cache miss

{ //for every iteration of outer loop

a[ii][jj] = a[ii][jj] + b[jj][ii]; //Add the two arrays

}