Floating-Point Number

A floating-point number is a finite or infinite number that is representable in a floating-point format, i.e., a floating-point representation that is not a NaN.

In the IEEE 754-2008 standard, all floating-point numbers - including zeros and infinities - are signed.

IEEE 754-2008 allows for five "basic formats" for floating-point numbers including three binary formats (32-, 64-, and 128-bit) and two decimal formats (64- and 128-bit); it also specifies several "recommended formats" for extending these basic formats to allow for even higher precision. All basic numerical formats are characterized by specifying a radix , a precision (i.e., the number of digits in the significand), and an exponent range determined by the precision of the given format. In general, the nonzero floating-point numbers have the form

where indicates the sign of the number, is its exponent, and is its significand. Note that the description in (1) is framed so that the significand is viewed in scientific form (with the period or radix point immediately following the first digit), though (1) may be re-expressed to view as an integer instead (whereby both and the exponent in (1) will change format accordingly).

	32-bit binary	64-bit binary	128-bit binary	64-bit Decimal	128-bit Decimal
digits of	24	53	113	16	34
emax	+127	+1023	+16383	+384	+6144

The above table summarizes the characteristics of the five basic number formats. Note that by definition.

	32-bit binary	64-bit binary	128-bit binary	64-bit decimal	128-bit decimal
digits of
emax

As mentioned previously, IEEE 754 also provides a framework of recommended formats by which the five basic formats may be extended. The table above summarizes the characteristics for the parameters of these extended-format floating-point numbers. Note that all such formats-both basic and recommended-allow for and , , and two NaNs.

In the literature, a distinction is made between normal and subnormal floating-point numbers. In particular, the smallest positive normal floating-point number is and the largest is ; on the other hand, non-zero floating-point numbers having magnitude less than may exist and are called subnormal. Subnormal numbers are characterized by the fact that they always have fewer than significant digits; moreover, every finite floating-point number is an integral multiple of the smallest subnormal magnitude

(IEEE Computer Society 2008).

Explore with Wolfram|Alpha

More things to try:

References

Goldberg, D. "What Every Computer Scientist Should Know About Floating-Point Arithmetic." ACM Comput. Surv. 23, 5-48, March 1991. http://docs.sun.com/source/806-3568/ncg_goldberg.html.Hauser, J. R. "Handling Floating-Point Exceptions in Numeric Programs." ACM Trans. Program. Lang. Sys. 18, 139-174, 1996. http://www.jhauser.us/publications/HandlingFloatingPointExceptions.html.IEEE Computer Society. "IEEE Standard for Floating-Point Arithmetic: IEEE Std 754-2008 (Revision of IEEE Std 754-1985)." 2008. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4610935.Severance, C. (Ed.). "IEEE 754: An Interview with William Kahan." Computer, 114-115, Mar. 1998.Stevenson, D. "A Proposed Standard for Binary Floating-Point Arithmetic: Draft 8.0 of IEEE Task P754." IEEE Comput. 14, 51-62, 1981.

Referenced on Wolfram|Alpha

Floating-Point Number

Cite this as:

Stover, Christopher. "Floating-Point Number." From MathWorld--A Wolfram Resource, created by Eric W. Weisstein. https://mathworld.wolfram.com/Floating-PointNumber.html

Floating-Point Number

See also

Explore with Wolfram|Alpha

References

Referenced on Wolfram|Alpha

Cite this as:

Subject classifications