The following sections explain the performance and accuracy considerations for the Fourier transforms, convolution, and correlation subroutines. For further details about performance and accuracy, see Chapter 2, Planning Your Program.
There are ESSL-specific rules that apply to the results of computations on the workstation processors using the ANSI/IEEE standards. For details, see What Data Type Standards Are Used by ESSL, and What Exceptions Should You Know About?.
The stride arguments, inc1h, inc1x, inc1y, inc2x, inc2y, inc3x, and inc3y, provide great flexibility in defining the input and output data arrays. The arrangement of data in storage, however, can have an effect upon cache performance. By using strides, you can have data scattered in storage. Best performance is obtained with data closely spaced in storage and with elements of the sequence in contiguous locations. The optimum values for inc1h, inc1x, and inc1y are 1.
In writing the calling program, you may find it convenient to declare X or Y as a two-dimensional array. For example, you can declare X in a DIMENSION statement as X(INC2X,M).
This section describes some ways to optimize performance in the Fourier transform subroutines.
Many of the Fourier transform, convolution, and correlation subroutines provide the facility for processing many sequences in one call. For short sequences, for example 1024 elements or less, this facility should be used as much as possible. This provides improved performance compared to processing only one sequence at a time.
If possible, you should use the same array for input and output. In addition, the requirements for the strides of the input and output arrays are explained in the Notes for each subroutine.
For improved performance, small values of inc1x and inc1y should be used, where applicable, preferably inc1x = 1 and inc1y = 1. A stride of 1 means the sequence elements are stored contiguously. Also, if possible, the sequences should be stored close to each other. For all the Fourier transform subroutines except _RCFT and _CRFT, you should use the STRIDE subroutine to determine the optimal stride(s) for your input or output data. Complete instructions on how to use STRIDE for each of these subroutines is included in STRIDE--Determine the Stride Value for Optimal Performance in Specified Fourier Transform Subroutines.
To obtain the best performance in the three-dimensional Fourier transform subroutines, you should use strides, inc2 and inc3, provided by the STRIDE subroutine and declare your three-dimensional data structure as a one-dimensional array. The three-dimensional Fourier transform subroutines assume that inc1 for the array is 1. Therefore, each element xijk for i = 0, 1, ..., n1-1, j = 0, 1, ..., n2-1, and k = 0, 1, ..., n3-1 of the three-dimensional data structure of dimensions n1 by n2 by n3 is stored in a one-dimensional array X(0:L) at location X(l), where l = i+inc2(j)+inc3(k). The minimum required value of L is calculated by inserting the maximum values for i, j, and k in the above equation, giving L = (n1-1)+inc2(n2-1)+inc3(n3-1). The minimum total size of array X is L+1. To ensure that this mapping is unique so no two elements xijk occupy the same array element, X(l), the subroutines have the following restriction: inc2 >= n1 and inc3 >= (inc2)(n2). This arrangement of array data in storage leaves some blank space between successive planes of the array X. By determining the best size for this space, specifying an optimum inc3 stride, the third dimension of the array does not create conflicts in the 3090 storage hierarchy.
If the inc3 stride value returned by the STRIDE subroutine turns out to be a multiple of inc2, the array X can be declared as a three-dimensional array as X(inc2,inc3/inc2,n3); otherwise, it can be declared as either a one-dimensional array, X(0:L), as described above, or a two-dimensional array X(0:inc3-1,0:n3-1), where xijk is stored in X(l,k) where l = i+(inc2)(j).
If you must multiply either the input or the output sequences by a common factor, you can avoid the multiplication by letting the scale argument contain the factor. The subroutines multiply the sine and cosine values by the scale factor during the initialization. Thus, scaling takes no time after the initialization of the Fourier transform calculations.
There are two levels of optimization for the fast Fourier transforms (FFTs) in the ESSL library. For sequences with a large power of 2 length, we provide efficient radix-2 and radix-8 transform implementations where cache use is optimized. The cache optimization includes ordering of operations to maximize stride-1 data access and prefetching cache lines.
Similar optimization techniques are used for sequence lengths which are not a power of 2 and mixed-radix FFT's are performed. Many short sequence FFT's have sequence size specific optimizations. Some of these optimizations were originally developed for a vector machine and have been adapted for cache based RISC machines (see references [1], [5], and [7])
The other optimization in the FFT routine is to treat multiple sequences as efficiently as possible. Techniques here include blocking sequences to fit into available CPU cache and transposing sequences to ensure stride-1 access. Whenever possible, the highest performance can be obtained when multiple sequences are transformed in a single call.
This section describes some ways to optimize performance in the convolution and correlation subroutines.
The subroutines SCON, SCOR, SACOR, SCOND, SCORD, SDCON, SDCOR, DDCON, and DDCOR compute convolutions, correlations, and autocorrelations using essentially the same methods. They make a decision, based on estimated timings, to use one of two methods:
Using this approach has the following advantages:
In general, using SCONF, SCORF, and SACORF provides the best performance, because the mixed-radix Fourier transform subroutines are used. However, if you can determine from your arguments that a direct method is preferred, you should use SCOND and SCORD instead. These give you better performance for the direct methods, and also give you additional capabilities.
In cases where there is doubt as to the best choice of a subroutine, perform timing experiments.
The subroutine SCORD can perform the functions of SCON and SACOR; that is, it can compute convolutions and autocorrelations. To compute a convolution, you must specify a negative stride for h (see Example 4 in SCORD). To compute the autocorrelation, you must specify the two input sequences to be the same (see Example 5 in SCORD).
The _DCON and _DCOR subroutines compute convolutions and correlations, respectively, by the direct method with decimated output. Setting the decimation interval id = 1 in SDCON and SDCOR provides the same function as SCOND and SCORD, respectively. Doing the same in DDCON and DDCOR provides long-precision versions of SCOND and SCORD, respectively, which are not otherwise available.
The direct methods used by the convolution and correlation subroutines use vector operations to accumulate sums of products. The products are computed and accumulated in long precision. As a result, higher accuracy can be obtained in the final results for some types of data. For example, if input data consists only of integers, and if no intermediate and final numbers become too large (larger than 224-1 for short-precision computations and larger than 256-1 for long-precision computations), the results are exact.
The Fourier methods used by the convolution and correlation subroutines compute Fourier transforms of input data that is multiplied element-by-element in short-precision arithmetic. The inverse Fourier transform is then computed. There are internally generated rounding errors in the Fourier transforms. It has been shown in references [ 96] and [ 85] that, in the case of white noise data, the relative root mean square (RMS) error of the Fourier transform is proportional to log2n with a very small proportionality factor. In general, with random, evenly distributed data, this is better than the RMS error of the direct method. However, one must keep in mind the fact that, while the Fourier method may yield a smaller root mean square error, there can be points with large relative errors. Thus, it can happen that some points, usually at the ends of the output sequence, can be obtained with greater relative accuracy with direct methods.
The convolution and correlation subroutines that use the Fourier methods determine a sequence length n, whose Fourier transform is computed using ESSL subroutines. In the simple case where iy0 = 0 for convolution or iy0 = -nh+1 for correlation, n is chosen as a value greater than or equal to the following, which is also acceptable to the Fourier tranform subroutines:
which is also acceptable to the Fourier subroutines.