TOPICS

# Floating-Point Number

A floating-point number is a finite or infinite number that is representable in a floating-point format, i.e., a floating-point representation that is not a NaN.

In the IEEE 754-2008 standard, all floating-point numbers - including zeros and infinities - are signed.

IEEE 754-2008 allows for five "basic formats" for floating-point numbers including three binary formats (32-, 64-, and 128-bit) and two decimal formats (64- and 128-bit); it also specifies several "recommended formats" for extending these basic formats to allow for even higher precision. All basic numerical formats are characterized by specifying a radix , a precision (i.e., the number of digits in the significand), and an exponent range determined by the precision of the given format. In general, the nonzero floating-point numbers have the form

where indicates the sign of the number, is its exponent, and is its significand. Note that the description in (1) is framed so that the significand is viewed in scientific form (with the period or radix point immediately following the first digit), though (1) may be re-expressed to view as an integer instead (whereby both and the exponent in (1) will change format accordingly).

 32-bit binary 64-bit binary 128-bit binary 64-bit Decimal 128-bit Decimal digits of 24 53 113 16 34 emax +127 +1023 +16383 +384 +6144

The above table summarizes the characteristics of the five basic number formats. Note that by definition.

 32-bit binary 64-bit binary 128-bit binary 64-bit decimal 128-bit decimal digits of emax

As mentioned previously, IEEE 754 also provides a framework of recommended formats by which the five basic formats may be extended. The table above summarizes the characteristics for the parameters of these extended-format floating-point numbers. Note that all such formats-both basic and recommended-allow for and , , and two NaNs.

In the literature, a distinction is made between normal and subnormal floating-point numbers. In particular, the smallest positive normal floating-point number is and the largest is ; on the other hand, non-zero floating-point numbers having magnitude less than may exist and are called subnormal. Subnormal numbers are characterized by the fact that they always have fewer than significant digits; moreover, every finite floating-point number is an integral multiple of the smallest subnormal magnitude

(IEEE Computer Society 2008).

Arithmetic, Biased Exponent, Floating-Point Algebra, Floating-Point Arithmetic, Floating-Point Exponent, Floating-Point Normal Number, Floating-Point Preferred Exponent, Floating-Point Quantum, Floating-Point Representation, IEEE 754-2008, Interval Arithmetic, NaN, Quiet NaN, Signaling NaN, Significand, Subnormal Number

This entry contributed by Christopher Stover

## Explore with Wolfram|Alpha

More things to try:

## References

Goldberg, D. "What Every Computer Scientist Should Know About Floating-Point Arithmetic." ACM Comput. Surv. 23, 5-48, March 1991. http://docs.sun.com/source/806-3568/ncg_goldberg.html.Hauser, J. R. "Handling Floating-Point Exceptions in Numeric Programs." ACM Trans. Program. Lang. Sys. 18, 139-174, 1996. http://www.jhauser.us/publications/HandlingFloatingPointExceptions.html.IEEE Computer Society. "IEEE Standard for Floating-Point Arithmetic: IEEE Std 754-2008 (Revision of IEEE Std 754-1985)." 2008. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4610935.Severance, C. (Ed.). "IEEE 754: An Interview with William Kahan." Computer, 114-115, Mar. 1998.Stevenson, D. "A Proposed Standard for Binary Floating-Point Arithmetic: Draft 8.0 of IEEE Task P754." IEEE Comput. 14, 51-62, 1981.

## Cite this as:

Stover, Christopher. "Floating-Point Number." From MathWorld--A Wolfram Web Resource, created by Eric W. Weisstein. https://mathworld.wolfram.com/Floating-PointNumber.html