Understanding Floating-point Performance

Denormal Computations

A denormal number is where the mantissa is non zero, but the exponent value is zero in an IEEE* floating-point representation. The smallest normal single precision floating point number greater than zero is about 1.175494350822288e-38. Smaller numbers are possible, but are denormal and take hardware or operating system intervention to handle them, which can cost hundreds of clock cycles.

In many cases, denormal numbers are evidence of an algorithm problem where a poor choice of algorithms is causing excessive computation in the denormal range. There are several ways to handle denormal numbers. For example, you can translate to normal, which means to multiply by a large scalar number, do the remaining computations in the normal space, then scale back down to denormal range. Do this whenever the small denormal values benefit the program design. In many cases, denormals that can be considered to be zero may be flushed to zero.

Denormals are computed in software on Itanium® processors. Hundreds of clock cycles are required, resulting in excessive kernel time. Attempt to understand why denormal results occur, and determine if they are justified. If you determine they are not justified, then use the following suggestions to handle the results:

Translate to normal problem by scaling values.
Increase precision and range by using a wider data type.
Set flush-to-zero mode in floating-point control register: -ftz (Linux*) or /Qftz (Windows*).

Denormal numbers always indicate a loss of precision, an underflow condition, and usually an error (or at least a less than desirable condition). On the Intel® Pentium® 4 processor and the Intel Itanium® processor, floating-point computations that generate denormal results can be set to zero, improving the performance.

The Intel compiler disables the FTZ and DAZ bits when you specify value-safe options, including the strict, precise, source, double, and extended models supported by the -fp-model (Linux*) or /fp (Windows*) option.

IA-32 compiler

The IA-32 compiler does not support the -ftz (Linux) or /Qftz (Windows) option; however, -xK or -xW (Linux) or /QxK or /QxW (Windows) will set the Flush-To-Zero mode in the SSE control register, which is the preferred approach. The only other way to enable Flush-to-Zero mode on an Intel® Pentium® 4 processor is to manually program the SSE2 Control Register as illustrated in the following example:

Example
void SIMDFlushToZero (void) { DWORD SIMDCtrl; _asm { STMXCSR SIMDCtrl mov eax, SIMDCtrl // flush-to-zero = bit 15 // mask underflow = bit 11 // denormals are zero = bit 6 or eax, 08840h mov SIMDCtrl, eax LDMXCSR SIMDCtrl } }

Example

void SIMDFlushToZero (void)

{

DWORD SIMDCtrl;

_asm

{

STMXCSR SIMDCtrl

mov eax, SIMDCtrl

// flush-to-zero = bit 15

// mask underflow = bit 11

// denormals are zero = bit 6

or eax, 08840h

mov SIMDCtrl, eax

LDMXCSR SIMDCtrl

}

Note

Mac OS*: The -ftz option is not supported.

Refer to IA-32 Intel® Architecture Software Developer’s Manual Volume 1: Basic Architecture (http://www.intel.com/design/pentiumii/manuals/243190.htm) for more details about flush to zero or specific bit field settings.

Itanium® compiler

The Itanium® compiler supports the -ftz (Linux) or /Qftz (Windows) option used to flush denormal results to zero when the application is in the gradual underflow mode. Use this option if the denormal values are not critical to application behavior. The default status of the option is OFF. By default, the compiler lets results gradually underflow.

Use the -ftz (Linux) or /Qftz (Windows) on the source containing main(); the option turns on the Flush-to-Zero (FTZ) mode for the process started by main(). The initial thread, and any threads subsequently created by that process, will operate in FTZ mode.

By default, the -O3 (Linux) or /O3 (Windows) option enables FTZ; in contrast, the -O2 (Linux) or /O2 (Windows) option disables FTZ. Alternately, you can use -no-ftz (Linux) or /Qftz- (Windows) to disable flushing denormal results to zero (DAZ).

For detailed optimization information related to microarchitectural optimization and cycle accounting, refer to Introduction to Microarchitectural Optimization for Itanium® 2 Processors Reference Manual also known as “Software Optimization book“ document number 251464-001 located at http://www.intel.com/software/products/vtune/techtopic/software_optimization.pdf.

Inexact Floating Point Comparisons

Some floating point applications exhibit extremely poor performance by not terminating. The applications do not terminate, in many cases, because exact floating-point comparisons were made against a given value. The following examples demonstrate the concept:

Example
if (foo() == 2.0)

Where foo() may be as close to 2.0 as can be imagined without actually exactly matching 2.0. You can improve the performance of such codes by using inexact floating point comparisons or fuzzy comparisons to test a value to within a certain tolerance, as shown below:

Example
epsilon = 1E-8; if (abs(foo() - 2.0) <= epsilon)

Example

epsilon = 1E-8;

if (abs(foo() - 2.0) <= epsilon)