Fixed Point Arithmetic
When we think of how to represent fractional numbers in code, we reach for double
and float
,
and almost never reach for anything else. There are several alternatives, including
constructive real numbers that are used in calculators,
and rational numbers. One alternative predates
all of these, including floating point, and actually allows you to compute faster than when you
use floating point numbers. That alternative is fixed point: a primitive form of decimal that does
not offer any of the conveniences of float
, but allows you to do decimal computations more quickly
and efficiently. Fixed point still has usage in some situations today, and it can be a potent tool
in your arsenal as a programmer if you find yourself working with math at high speed.
Introduction to Fixed Point Numbers
Fixed point numbers are conceptually simple: they are like integers, except that the decimal point is somewhere other than after the rightmost bit. Fixed-point numbers are stored in identical formats to integers. Mathematically, fixed point numbers have the following value:
$$ N = I * 2^{-k} $$
where $I$ is the integer value of the number and $k$ is a constant number of bits that are to the right of the decimal point. This allows you to represent non-integers (positive $k$) and large numbers (negative $k$) with the same precision, dynamic range, and storage format as an integer.
In comparison, floating point numbers are much more complicated. Here is how a double-precision floating point number relates to the bits that are stored:
$$ N = (-1)^S * 2^{E - 1023} * M $$
Where $S$ is a sign bit, $E$ is an 11-bit exponent, and $M$ is a mantissa, which is a 53 bit number comprised of a 52-bit fraction stored in the number and an implied 1 or 0 bit depending on the value of the exponent. Fixed point is much simpler!
Examples of Fixed Point Numbers
Some examples are here, showing numbers represented in 16-bit unsigned fixed point:
Number | Fractional Bits ($k$) | Hex | Hex with Decimal Point |
---|---|---|---|
$1.5$ | 1 | 0003 |
0001 $.$1 |
$1.5$ | 4 | 0018 |
001 $.$8 |
$1.5$ | 8 | 0180 |
01 $.$80 |
$0.75$ | 2 | 0003 |
0000 $.$3 |
$0.75$ | 8 | 00C0 |
00 $.$C0 |
$0.75$ | 16 | C000 |
$0.$C000 |
$2^{-16}$ | 16 | 0001 |
$0.$0001 |
You can also use fixed point numbers to imply a number of leading or trailing zeros by using large values of $k$ (to get leading zeros) or negative values of $k$ (to get trailing zeros):
Number | Fractional Bits ($k$) | Hex | Hex with Decimal Point |
---|---|---|---|
$2^{-32}$ | 32 | 0001 |
$0.0000$0001 |
$2^{-16}$ | 32 | 0100 |
$0.0000$0100 |
$2^{16}$ | -16 | 0001 |
0001 $0000.0$ |
$2^{16}$ | -8 | 0100 |
0100 $00.0$ |
Here are a few numbers that can’t be exactly represented in 16-bit unsigned fixed point with a given number of fractional bits:
Number | Fractional Bits ($k$) | Problem |
---|---|---|
$1.25$ | 1 | Between 0002 $(1.0)$ and 0003 $(1.5)$ |
$1.5$ | 16 | Overflow, greater than FFFF |
$2^{-32}$ | 16 | Between 0000 $(0)$ and 0001 $(2^{-16})$ |
Fixed point numbers can also be signed using two’s complement, like signed integers. For example:
Number | Fractional Bits ($k$) | Hex | Hex with Decimal Point |
---|---|---|---|
$1.5$ | 4 | 0018 |
001 $.$8 |
$-1.5$ | 4 | FFE8 |
FFE $.$8 |
$1.75$ | 8 | 01C0 |
01 $.$C0 |
$-1.75$ | 8 | FE40 |
FE $.$40 |
$0.75$ | 15 | 6000 |
0 $.$600 |
$-0.75$ | 15 | a000 |
1 $.$200 |
Fixed Point Operations
Fixed point numbers can be used to efficiently compute every usual arithmetic operation using no more than one integer operation and one fixed-offset bit shift. However, unlike floating point operations, fixed point operations are not hardware accelerated on modern server CPUs, and no batteries are included!
Addition and Subtraction
Fixed point addition and subtraction are exactly like integer addition and subtraction if the numbers have the same number of fractional bits, but a shift is required to add and subtract numbers with different $k$. Here is an example with the same number of fractional bits (using subscripts to show the number of fractional bits):
01 $.$80 |
Integer ADD | 01 $.$40 |
Result: | 02 $.$C0 |
---|---|---|---|---|
$1.5_8$ | $+$ | $1.25_8$ | $=$ | $2.75_8$ |
Here is an example of addition without justification to align the decimal point:
01 $.$80 |
Integer ADD | 001 $.$4 |
Result: | 01 $.$94 / 019 $.$4 |
---|---|---|---|---|
$1.5_8$ | $+$ | $1.25_4$ | $\neq$ | $1.578125_8$ or $25.25_4$ |
The result of the addition is not correct without shifting to justify. The correct process looks like:
01 $.$80 |
Right shift 4 | 001 $.$8 |
Integer ADD | 001 $.$4 |
Result: | 002 $.$C |
---|---|---|---|---|---|---|
$1.5_8$ | Convert to $x_4$ | $1.5_4$ | $+$ | $1.25_4$ | $=$ | $2.75_4$ |
This means that adding two fixed point numbers is either just a simple integer addition or a constant shift and an add. That is much faster than a floating point addition. It also means that when the numbers have the same number of fractional bits, addition is associative, so operations can be re-ordered freely without affecting the result.
Subtraction is much the same as addition. The two operands need to be aligned to have the same number of fractional bits, but once that occurs, you can use an integer subtraction operation. Also, negative numbers work just fine:
01 $.$80 |
Integer SUB | 02 $.$40 |
Result: | FF $.$40 |
---|---|---|---|---|
$1.5_8$ | $-$ | $2.25_8$ | $=$ | $-0.75_8$ |
Pitfalls when Justifying Fixed Point Numbers
Shifting right to remove fractional bits can result in loss of precision:
01 $.$88 |
Right shift 4 | 001 $.$8 |
---|---|---|
$1.53125_8$ | Inexact conversion to $x_4$ | $1.5_4$ |
Additionally, a left shift to add fractional bits can result in an overflow:
100 $.$8 |
Left shift 4 | 00 $.$80 |
---|---|---|
$256.5_4$ | Overflow converting to $x_8$ | $0.5_8$ |
The loss of precision on a right shift is similar to what happens when adding floating point numbers. However, the possibility of overflow is not there with floating point numbers. Using fixed point numbers requires careful analysis of the ranges of numbers involved and the required precisions to avoid the risk of overflow.
Multiplication
Multiplication and division are different: when two fixed point numbers are multiplied, the integer components are multiplied or divided, but the result has a different $k$ value than the inputs. For multiplication, the result has $k = k_1 + k_2$, and is computed using an integer multiplication instruction. For example, if you multiply two numbers with 4 fractional bits each, the result will have 8 fractional bits, and the number must be shifted back if you would like a result with 4 fractional bits.
This means that multiplying a fixed point number by an integer results in a fixed point number with the same number of fractional bits:
01 $.$80 |
Integer MUL | 0005 |
Result: | 07 $.$80 |
---|---|---|---|---|
$1.5_8$ | $\times$ | $5_{INT}$ | $=$ | $7.5_8$ |
The full result of an integer multiplication has two times as many bits as the size of the inputs.
ARM allow you to access the high part of the multiplication using the MULH
and UMULH
instructions
and x86 provides a double-width result to the MUL
and MULX
instructions using a pair of result
registers. Here is an example where the high bits of the multiplication carry the important
information of the result:
$0.$C000 |
Integer MUL | $0.$C000 |
Result: | $0.$9000 0000 |
---|---|---|---|---|
$0.75_{16}$ | $\times$ | $0.75_{16}$ | $=$ | $.5625_{32}$ |
These instructions allow you to use register-width fixed point arithmetic. Without access to the high part of the multiplication result, you would be effectively limited to half-width fixed point (a 64-bit CPU would only be able to multiply 32-bit fixed point numbers).
In this example, we can see that the information can be split between the high and low halves of the result:
01 $.$80 |
Integer MUL | 01 $.$80 |
Result: | 0002 $.$4000 |
---|---|---|---|---|
$1.5_{8}$ | $\times$ | $1.5_{8}$ | $=$ | $2.25_{16}$ |
In cases like this, recovering a number with 8 fractional bits would involve several assembly instructions to shift the registers to the appropriate place and combine them. This would definitely be slower than a floating-point multiplication: floating point multiplication is actually relatively easy to compute! That said, some DSPs and embedded processors have a shifter after their multiplier to allow native fixed point multiplication wihtout having any register justification problems.
Division
For division, the result has $k = k_1 - k_2$ fractional bits when dividing $N_1$ by $N_2$. Again, dividing by an integer yields a result that has the same number of fractional bits as the input. This is largely similar to multiplication, and again you can use the integer division operation to divide fixed point numbers.
Three examples are below. First, dividing by an integer:
01 $.$80 |
Integer DIV | 0003 |
Result: | 00 $.$80 |
---|---|---|---|---|
$1.5_8$ | $\div$ | $3_{INT}$ | $=$ | $0.5_8$ |
Dividing by a number of the same precision:
01 $.$80 |
Integer DIV | 00 $.$40 |
Result: | 0006 |
---|---|---|---|---|
$1.5_8$ | $\div$ | $0.25_8$ | $=$ | $6_{INT}$ |
Dividing by a fixed point number with a different number of fractional bits:
01 $.$80 |
Integer DIV | 000 $.$8 |
Result: | 003 $.$0 |
---|---|---|---|---|
$1.5_8$ | $\div$ | $0.5_4$ | $=$ | $3_4$ |
Thanks to rounding, division of fixed point numbers can be dangerous, particularly when the divisor has more fractional bits than the dividend. Here is one case where an integer division will not produce a correct result:
01 $.$80 |
Integer DIV | 0 $.$400 |
Result: | 0000 $0.0$ |
---|---|---|---|---|
$1.5_8$ | $\div$ | $0.25_{12}$ | $\neq$ | $0_{-4}$ |
In this case, the result should have been 6 but the result register cannot represent any number that isn’t a multiple of 16. If the result would be a multiple of 16, you actually could get a correct division:
08 $.$00 |
Integer DIV | 0 $.$400 |
Result: | 0002 $0.0$ |
---|---|---|---|---|
$8_8$ | $\div$ | $0.25_{12}$ | $=$ | $32_{-4}$ |
If you find yourself working with fixed point numbers, take great care around divisions. Integer division has nonlinearity that occurs due to rounding, which can accidentally end up in your code if you are not careful to make sure that the dividend has more precision than the divisor.
Operation Chains
Successful use of fixed point arithmetic often depends on constructing chains of operations that fit together in sequence, and tracking $k$ and the dynamic range of the values in the operation chain to avoid overflows and unnecessary shift operations.
Extremely long operation chains can be found in hardware accelerators and DSP pipelines for RF and audio systems, where they compute functions like Fourier transforms, filters, and more complicated functions like atomic interactions.
Unlike floating point, $k$ is a property of a variable, not a stored quantity, so it can take a lot of developer work to keep track of $k$ for each variable and add shifts when needed. Mathematical computing software can also be used to convert high-level code into fixed point operation pipelines.
Fixed Point in the Wild
Thanks to the proliferation of floating point hardware, fixed point numbers have been relegated to a few niches, including embedded systems, signal processing, and trading systems. However, fixed point numbers can have broad utility if you are interested in computing things as quickly as possible. Fixed point computations use the integer side of a modern CPU, which allows them to take advantage of faster instructions and more available instruction ports.
Financial systems often use a form of decimal fixed point numbers to represent prices. Prices are often manipulated as dollars with two decimal places (integer number of cents) or dollars with four decimal places. Some bond markets use 8 fractional bits, with the minimum increment of price being 1/256th of a dollar. This allows exchanges to have fixed precision across the range of prices: a one-dollar stock has the same price precision as expensive stocks like Google or Berkshire Hathaway. Floating point prices do not have this property: the minimum increment of a floating point number depends on its magnitude. For example, the NASDAQ was recently caught using two base 10 decimal places with 32-bit prices.
Fixed point is also commonly used in hardware accelerators and custom hardware, since it has smaller silicon area than an equivalent floating point calculation and uses less power. Hardware accelerators also have the benefit of custom word sizes, and shifting numbers or adding zeros to a number is completely free in hardware as long as the shift is a fixed size (you can just connect the wires differently). Arguably, you could suggest that neural networks have rediscovered fixed point since neural network inference now often uses 8- or 32-bit integer arithmetic.
Fixed point can often be found be seen in places where resources are tightly constrained.
For example, my division experiment used fixed point numbers to represent
quantities between 0 and 1, where we used 16 and 32 bits behind the decimal place. Audio and DSP
systems frequently use fixed point math for calculations. Many embedded DSPs don’t even have
floating point hardware accelerators because fixed-point math is so effective in their domain.
Interestingly, fixed point is the highest-precision way to represent decimals on a computer and
keep the benefit of reasonably fast math: you get 64 bits of precision compared to 53 for double
.
Fixed Point in Compilers
When you add the following function to a C program, you will get an optimization that is enabled by the math of fixed point arithmetic:
uint64_t div_three(uint64_t x) {
return x / 3;
}
You would expect that this would compile to a division. However, we can compute it a different way,
by multiplying by fixed point $\frac{1}{3}$. The compiler does this! Its assembly output (with
-O3 -march=skylake
) is:
movabs rax, 0xAAAAAAAAAAAAAAAB
mov rdx, rdi
mulx rax, rax, rax
shr rax
ret
The first line of the assembly stores fixed point $ \left(\frac{2}{3}\right)_{64} $ (rounded up)
in the RAX
register. The third instruction is the multiplication (the second instruction,
mov rdx, rdi
, sets up an implied operand of the mulx
instruction). The fourth instruction
is a shift that divides by 2, so the result is equal to $\frac{x}{3}$.
The multiplication is by $\frac{2}{3}$ instead of $\frac{1}{3}$ so that the result in RAX
has
maximum precision. $\frac{1}{3}$ can’t be exactly represented in fixed point, so the slightly
roundabout method of multiplying by $\frac{2}{3}$ and dividing by 2 is required to avoid
an error in the least significant bit when a large integer is passed in.
Compiler optimizations like this give us the speed benefit of fixed point arithmetic every day. Using an integer division would be several times slower.
Conclusions
Fixed point numbers have some speed benefits on server-scale systems and allow you to do
non-integer math on systems without floating point hardware. However, these benefits cost a lot
of developer time and effort, so sticking with float
and double
are usually the way to go.
Fixed point has its niches, like prices for financial trading and audio processing, and it is a
good tool when you want to compute as fast as possible and can give up some dynamic range. There
are a few places where fixed point arithmetic can be an important tool, and in those places, it is
invaluable.