A floating-point number is a finite or infinite number that is representable in a floating-point format, i.e., a floating-point
representation that is not a NaN.

In the IEEE 754-2008 standard, all floating-point
numbers - including zeros and infinities - are signed.

IEEE 754-2008 allows for five "basic formats" for floating-point numbers including three binary formats (32-, 64-, and 128-bit) and two decimal formats (64-
and 128-bit); it also specifies several "recommended formats" for extending
these basic formats to allow for even higher precision. All basic numerical formats
are characterized by specifying a radix , a precision (i.e., the number of digits in the significand), and an exponent
range
determined by the precision of the given format. In general, the nonzero floating-point
numbers have the form

where
indicates the sign of the number, is its exponent, and is its significand. Note that the description
in (1) is framed so that the significand is viewed in scientific form (with the period or radix
point immediately following the first digit), though (1) may be re-expressed
to view
as an integer instead (whereby both and the exponent in (1) will change format accordingly).

32-bit binary

64-bit binary

128-bit binary

64-bit Decimal

128-bit Decimal

digits of

24

53

113

16

34

emax

+127

+1023

+16383

+384

+6144

The above table summarizes the characteristics of the five basic number formats. Note that
by definition.

32-bit binary

64-bit binary

128-bit binary

64-bit decimal

128-bit decimal

digits of

emax

As mentioned previously, IEEE 754 also provides a framework of recommended formats by which the five basic formats may be extended. The table above summarizes the characteristics
for the parameters of these extended-format floating-point numbers. Note that all
such formats-both basic and recommended-allow for and , , and two NaNs.

In the literature, a distinction is made between normal and subnormal floating-point numbers. In particular,
the smallest positive normal floating-point number is and the largest is ; on the other hand, non-zero floating-point
numbers having magnitude less than may exist and are called subnormal. Subnormal numbers
are characterized by the fact that they always have fewer than significant digits; moreover, every finite floating-point
number is an integral multiple of the smallest subnormal magnitude

Goldberg, D. "What Every Computer Scientist Should Know About Floating-Point Arithmetic." ACM Comput. Surv.23, 5-48,
March 1991. http://docs.sun.com/source/806-3568/ncg_goldberg.html.Hauser,
J. R. "Handling Floating-Point Exceptions in Numeric Programs." ACM
Trans. Program. Lang. Sys.18, 139-174, 1996. http://www.jhauser.us/publications/HandlingFloatingPointExceptions.html.IEEE
Computer Society. "IEEE Standard for Floating-Point Arithmetic: IEEE Std 754-2008
(Revision of IEEE Std 754-1985)." 2008. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4610935.Severance,
C. (Ed.). "IEEE 754: An Interview with William Kahan." Computer,
114-115, Mar. 1998.Stevenson, D. "A Proposed Standard for Binary
Floating-Point Arithmetic: Draft 8.0 of IEEE Task P754." IEEE Comput.14,
51-62, 1981.