TOPICS
Search

Floating-Point Number


A floating-point number is a finite or infinite number that is representable in a floating-point format, i.e., a floating-point representation that is not a NaN.

In the IEEE 754-2008 standard, all floating-point numbers - including zeros and infinities - are signed.

IEEE 754-2008 allows for five "basic formats" for floating-point numbers including three binary formats (32-, 64-, and 128-bit) and two decimal formats (64- and 128-bit); it also specifies several "recommended formats" for extending these basic formats to allow for even higher precision. All basic numerical formats are characterized by specifying a radix b in {2,10}, a precision p (i.e., the number of digits in the significand), and an exponent range emin,emin+1,...,emax determined by the precision of the given format. In general, the nonzero floating-point numbers have the form

 (-1)^s×b^e×m,

where s in {0,1} indicates the sign of the number, emin<=e<=emax is its exponent, and 0<=m<=b is its significand. Note that the description in (1) is framed so that the significand m is viewed in scientific form (with the period or radix point immediately following the first digit), though (1) may be re-expressed to view m as an integer instead (whereby both m and the exponent e in (1) will change format accordingly).

32-bit binary64-bit binary128-bit binary64-bit Decimal128-bit Decimal
digits of p24531131634
emax+127+1023+16383+384+6144

The above table summarizes the characteristics of the five basic number formats. Note that emin=1-emax by definition.

32-bit binary64-bit binary128-bit binary64-bit decimal128-bit decimal
digits of p>=32>=64>=128>=22>=40
emax>=1023>=16383>=65535>=6144>=24576

As mentioned previously, IEEE 754 also provides a framework of recommended formats by which the five basic formats may be extended. The table above summarizes the characteristics for the parameters of these extended-format floating-point numbers. Note that all such formats-both basic and recommended-allow for +infty and -infty, -0, and two NaNs.

In the literature, a distinction is made between normal and subnormal floating-point numbers. In particular, the smallest positive normal floating-point number is b^(emin) and the largest is b^(emax)×(b-b^(1-p)); on the other hand, non-zero floating-point numbers having magnitude less than b^(emin) may exist and are called subnormal. Subnormal numbers are characterized by the fact that they always have fewer than p significant digits; moreover, every finite floating-point number is an integral multiple of the smallest subnormal magnitude

 b^(emin)×b^(1-p)

(IEEE Computer Society 2008).


See also

Arithmetic, Biased Exponent, Floating-Point Algebra, Floating-Point Arithmetic, Floating-Point Exponent, Floating-Point Normal Number, Floating-Point Preferred Exponent, Floating-Point Quantum, Floating-Point Representation, IEEE 754-2008, Interval Arithmetic, NaN, Quiet NaN, Signaling NaN, Significand, Subnormal Number

This entry contributed by Christopher Stover

Explore with Wolfram|Alpha

References

Goldberg, D. "What Every Computer Scientist Should Know About Floating-Point Arithmetic." ACM Comput. Surv. 23, 5-48, March 1991. http://docs.sun.com/source/806-3568/ncg_goldberg.html.Hauser, J. R. "Handling Floating-Point Exceptions in Numeric Programs." ACM Trans. Program. Lang. Sys. 18, 139-174, 1996. http://www.jhauser.us/publications/HandlingFloatingPointExceptions.html.IEEE Computer Society. "IEEE Standard for Floating-Point Arithmetic: IEEE Std 754-2008 (Revision of IEEE Std 754-1985)." 2008. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4610935.Severance, C. (Ed.). "IEEE 754: An Interview with William Kahan." Computer, 114-115, Mar. 1998.Stevenson, D. "A Proposed Standard for Binary Floating-Point Arithmetic: Draft 8.0 of IEEE Task P754." IEEE Comput. 14, 51-62, 1981.

Cite this as:

Stover, Christopher. "Floating-Point Number." From MathWorld--A Wolfram Web Resource, created by Eric W. Weisstein. https://mathworld.wolfram.com/Floating-PointNumber.html

Subject classifications