TOPICS

# Floating-Point Arithmetic

Simply stated, floating-point arithmetic is arithmetic performed on floating-point representations by any number of automated devices.

Traditionally, this definition is phrased so as to apply only to arithmetic performed on floating-point representations of real numbers (i.e., to finite elements of the collection of floating-point numbers) though several additional types of floating-point data including signed infinities and NaNs are also commonly allowed as inputs for such functions.

Despite the succinctness of the definition, it is worth noting that the most widely-adopted standards in computing consider nearly the entirety of floating-point theory under the heading "floating-point arithmetic." One reason for this breadth stems from the fact that any floating-point representation can account for but a finite subset of the continuum of real numbers; this finiteness presents a variety of unforeseen obstacles, chief among which is the fact that certain properties of real arithmetic (e.g., associativity of addition) sometimes fail to hold for floating-point numbers (IEEE Computer Society 2008). As a result, any comprehensive treatment of floating-point arithmetic and/or algebra must address numerous caveats including representations of floating-point numbers, rounding, etc. before ever discussing the actual operations themselves.

As of 2014, the most commonly implemented standard for floating point arithmetic is the IEEE Standard 754-2008 for Floating-Point Arithmetic (written shorthand as IEEE 754-2008 and as IEEE 754 henceforth). This framework is a massive overhaul of its predecessor - IEEE 754-1985 - and includes a built-in collection of guidelines specifying nearly every conceivable aspect of floating-point theory. In particular, IEEE 754 addresses the following aspects of floating-point theory in considerable detail:

1. Floating-point representations and formats.

2. Attributes of floating-point representations, including rounding of floating-point numbers.

3. Arithmetic and algebraic operations on floating-point representations.

4. Infinity, non-numbers (NaNs), signs, and exceptions.

A number of the above topics are discussed across multiple sections of the standard's documentation (IEEE Computer Society 2008).

The "required" arithmetical operations defined by IEEE 754 on floating-point representations are addition, subtraction, multiplication, division, square root, and fused multiply-add (a ternary operation defined by ); these are required in the sense that adherence to the framework requires these operations to be supported with correct rounding throughout. A number of other "recommended" operations are also provided within the framework, some of which are arithmetic in nature; these are recommended in the sense that support for them is not strictly required by the framework. Finally, note that the framework includes both a collection of utility functions which may also be considered arithmetic, namely copy, negate, and abs, as well as a number of closely-related functions defined for vector-valued input (IEEE Computer Society 2008, pp. 46-47)

 operation function domain possible exceptions exp Overflow; Underflow expm1 Overflow; Underflow exp2 Overflow; Underflow exp2m1 Overflow; Underflow exp10 Overflow; Underflow exp10m1 Overflow; Underflow log Divide By Zero (if ); Invalid Operation (if ) log2 Divide By Zero (if ); Invalid Operation (if ) log10 Divide By Zero (if ); Invalid Operation (if ) logp1 Divide By Zero (if ); Invalid Operation (if ); Underflow log2p1 Divide By Zero (if ); Invalid Operation (if ); Underflow log10p1 Divide By Zero (if ); Invalid Operation (if ); Underflow Overflow; Underflow rSqrt Invalid Operation (if ); Divide By Zero (if ) Invalid Operation (if ) Invalid Operation (if or and even); Overflow/Underflow (if ) Several cases Several cases Several cases

The above table summarizes the recommended arithmetic operations within IEEE 754. Note that the particulars of the exceptions labeled "Several cases" are addressed in detail in the IEEE 754 documentation (IEEE Computer Society 2008, pp 43-45).

As noted above, even some of the basic required arithmetic operators behave unpredictably in light of floating-point representations and rounding. This stems from the fact that the "normal" arithmetic operations are assumed within IEEE 754 to have infinite precision while the values of floating-point addition, subtraction, multiplication, and division, written symbolically as , , , and , respectively, are computed by performing the "normal" operations of , , , and , respectively, on floating-point numbers written in terms of a common exponent and rounding the result to a fixed number of significant digits (by way of the so-called preferred exponent) afterward. As a result, loss of precision, overflow, and underflow can all occur during the arithmetic and/or rounding steps of the computation. For example, the result of adding and is exactly

 (1)

On the other hand, in a framework with radix and 7-digit precision, the value returned by floating-point addition would be

 (2)

Similarly, given and , one has that

 (3)

using the 7-digit precision assumed above. However, one has that

 (4)

thus yielding a complete lack of precision. Note that in extreme cases like this, systems implementing IEEE 754 won't actually yield as a result: In particular, such a scenario will trigger an underflow warning. Details and caveats of the other arithmetic functions mentioned throughout can be found in the documentation (IEEE Computer Society 2008, §5 and §9).

Some distinction is made between floating-point operations which are arithmetic in nature and those which are algebraic/trigonometric: Operations of the latter variety typically fall under the heading of floating-point algebra.

Arithmetic, Biased Exponent, Floating-Point Algebra, Floating-Point Exponent, Floating-Point Normal Number, Floating-Point Number, Floating-Point Preferred Exponent, Floating-Point Quantum, Floating-Point Representation, IEEE 754-2008, Interval Arithmetic, NaN, Quiet NaN, Signaling NaN, Significand, Subnormal Number

This entry contributed by Christopher Stover

## Explore with Wolfram|Alpha

More things to try:

## References

Goldberg, D. "What Every Computer Scientist Should Know About Floating-Point Arithmetic." ACM Comput. Surv. 23, 5-48, March 1991. http://docs.sun.com/source/806-3568/ncg_goldberg.html.Hauser, J. R. "Handling Floating-Point Exceptions in Numeric Programs." ACM Trans. Program. Lang. Sys. 18, 139-174, 1996. http://www.jhauser.us/publications/HandlingFloatingPointExceptions.html.IEEE Computer Society. "IEEE Standard for Floating-Point Arithmetic: IEEE Std 754-2008 (Revision of IEEE Std 754-1985)." 2008. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4610935.Severance, C. (Ed.). "IEEE 754: An Interview with William Kahan." Computer, 114-115, Mar. 1998.Stevenson, D. "A Proposed Standard for Binary Floating-Point Arithmetic: Draft 8.0 of IEEE Task P754." IEEE Comput. 14, 51-62, 1981.

## Referenced on Wolfram|Alpha

Floating-Point Arithmetic

## Cite this as:

Stover, Christopher. "Floating-Point Arithmetic." From MathWorld--A Wolfram Web Resource, created by Eric W. Weisstein. https://mathworld.wolfram.com/Floating-PointArithmetic.html