This topic provides general guidelines for coding practices and techniques for:
IA-32 architecture supporting MMX(TM) technology and Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2)
Itanium® architecture
This topic describes practices, tools, coding rules and recommendations associated with the architecture features that can improve the performance on IA-32 and Itanium processor families. For details about optimization for IA-32 processors, see the Intel® Architecture Optimization Reference Manual. For all details about optimization for Itanium processor family, see the Intel Itanium 2 Processor Reference Manual for Software Development and Optimization.
Note
If a guideline refers to a particular architecture only, this architecture
is explicitly named. The default is for both IA-32 and Itanium architectures.
Performance of compiler-generated code may vary from one compiler to another. The Intel® Fortran Compiler generates code that is highly optimized for Intel architectures. You can significantly improve performance by using various compiler optimization options. In addition, you can help the compiler to optimize your Fortran program by following the guidelines described here.
To achieve optimum processor performance in your Fortran application, do the following:
avoid memory access stalls
ensure good floating-point performance
ensure good SIMD integer performance
use vectorization.
The coding practices, rules, and recommendations described here will contribute to optimizing the performance on Intel architecture-based processors.
The Intel compiler lays out Fortran arrays in column-major order. For example, in a two-dimensional array, elements A(22, 34) and A(23, 34) are contiguous in memory. For best performance, code arrays so that inner loops access them in a contiguous manner. Consider the following examples.
The code in example 1 will likely have higher performance than the code in example 2.
Example 1
DO J = 1, N
DO I = 1, N
B(I,J) = A(I, J) + 1
END DO
END DO
The code above illustrates access to arrays A and B in the inner loop I in a contiguous manner which results in good performance.
Example 2
DO I = 1, N
DO J = 1, N
B(I,J) = A(I, J) + 1
END DO
END DO
The code above illustrates access to arrays A and B in inner loop J in a non-contiguous manner which results in poor performance.
The compiler itself can transform the code so that inner loops access memory in a contiguous manner. To do that, you need to use advanced optimization options, such as -O3 for both IA-32 and Itanium architectures, and -O3 and -ax{K|W|N|B|P} for IA-32 only.
Alignment is a very important factor in ensuring good performance. Aligned memory accesses are faster than unaligned accesses. If you use the interprocedural optimization on multiple files (the -ipo option), the compiler analyzes the code and decides whether it is beneficial to pad arrays so that they start from an aligned boundary. Multiple arrays specified in a single common block can impose extra constraints on the compiler. For example, consider the following COMMON statement:
COMMON /AREA1/ A(200), X, B(200)
If the compiler added padding to align A(1) at a 16-byte aligned address, the element B(1) would not be at a 16-byte aligned address. So it is better to split AREA1 as follows.
COMMON /AREA1/ A(200)
COMMON /AREA2/ X
COMMON /AREA3/ B(200)
The above code provides the compiler maximum flexibility in determining the padding required for both A and B.
To improve floating-point performance, observe these general rules:
Avoid exceeding representable ranges during computation since handling these cases can have a performance impact. Use REAL variables in single-precision format unless the extra precision obtained through DOUBLE or REAL*8 variables is required. Using variables with a larger precision formation will also increase memory size and bandwidth requirements.
For IA-32 only: Avoid repeatedly changing rounding modes between more than two values, which can lead to poor performance when the computation is done using non-SSE instructions. Hence avoid using FLOOR and TRUNC instructions together when generating non-SSE code. The same applies for using CEIL and TRUNC.
Another way to avoid the problem is to use the -x{K|W|N|B|P} options to do the computation using SSE instructions.
Reduce the impact of denormal exceptions for both architectures as described below.
Floating point computations with underflow can result in denormal values that have an adverse impact on performance.
For IA-32: Take advantage of the SIMD capabilities of Streaming SIMD Extensions (SSE), and Streaming SIMD Extensions 2 (SSE2) instructions. The -x{K|W|N|B|P} options enable the flush-to-zero (FTZ) mode in SSE and SSE2 instructions, whereby underflow results are automatically converted to zero, which improves application performance. In addition, the -xP option also enables the denormals-are-zero (DAZ) mode, whereby denormals are converted to zero on input, further improving performance. An application developer willing to trade pure IEEE-754 compliance for speed would benefit from these options. For more information on FTZ and DAZ, see Setting FTZ and DAZ Flags and "Floating-point Exceptions" in the Intel® Architecture Optimization Reference Manual.
For Itanium architecture: enable flush-to-zero (FTZ) mode with the -ftz option set by -O3 option.
Many applications significantly increase their performance if they can implement vectorization, which uses streaming SIMD SSE2 instructions for the main computational loops. The Intel Compiler turns vectorization on (auto-vectorization) or you can implement it with compiler directives.
See Auto-vectorization (IA-32 Only) section for complete details.
The Intel Fortran Compiler and the Intel® Threading Toolset have the capabilities that make developing multithreaded application easy. See Parallel Programming with Intel Fortran. Multithreaded applications can show significant benefit on multiprocessor Intel symmetric multiprocessing (SMP) systems or on Intel processors with Hyper-Threading technology.