ArchitectureFloating Point Arithmetic

🔢 Floating Point Arithmetic

IEEE 754 standard for representing and computing with real numbers in binary. Single precision (32-bit) and double precision (64-bit) formats.

IEEE 754 Single Precision Format

ℹ️

The Standard

IEEE 754 is the universal standard for floating-point computation. Single precision uses 32 bits: 1 sign bit, 8 exponent bits (biased by 127), and 23 mantissa bits (with implicit leading 1 for normalized numbers).

32-bit Floating Point Layout

1 bit

bit 31

Exponent

8 bits

bits 30-23

Mantissa / Fraction

23 bits

bits 22-0

Value = (-1)^S × 2^(E-127) × 1.M | Bias = 127
Range: ±1.18×10⁻³⁸ to ±3.4×10³⁸ | Precision: ~7 decimal digits

Interactive IEEE 754 Converter

Enter a decimal number:

Full 32-bit Binary

01000010001010100000000000000000

SignExponentMantissa

Hex

0x422A0000

Sign

Exponent

132

Bias Adj

Normalized Representation

(-1)^0 × 2^(5) × 1.01010100...

Special Values

Value	Hex	Sign	Exp	Mantissa	Meaning
+0	0x00000000	0	00	000...000	Zero
-0	0x80000000	1	00	000...000	Zero
+∞	0x7F800000	0	FF	000...000	Infinity
-∞	0xFF800000	1	FF	000...000	Infinity
NaN	0x7FC00000	0/1	FF	100...000	Not a Number
Max Normal	0x7F7FFFFF	0	FE	111...111	Normal
Min Normal	0x00800000	0	01	000...000	Normal
Denorm Min	0x00000001	0	00	000...001	Denormalized

⚠️

Special Case Rules

Zero (E=0, M=0): Signed zero exists (+0 and -0)
Infinity (E=255, M=0): Represents overflow or division by zero
NaN (E=255, M≠0): Result of invalid operations (0/0, ∞-∞)
Denormalized (E=0, M≠0): Gradual underflow; implicit leading 0 instead of 1

Floating Point Addition

Adding two floating-point numbers requires aligning exponents, adding significands, then normalizing the result.

Addition Steps

Align

Shift smaller exponent to match larger

Add

Add significands (with hidden bit)

Normalize

Shift result and adjust exponent

Round

Round to fit 23-bit mantissa

Check

Check for overflow/underflow

Example: 42.5 + 3.75

42.5 = 1.010101 × 2⁵ → E=132(5+127), M=0101010...

3.75 = 1.111 × 2¹ → align: shift right 4 → 0.0001111 × 2⁵

Sum: 1.010101 + 0.0001111 = 1.0111001 × 2⁵

Result: 0 10000100 0111001000... = 46.25 (hex: 0x42390000)

Code Example

python

import struct

def float_to_ieee754(f: float) -> dict:
    """Convert a Python float to IEEE 754 single precision representation."""
    bits = struct.pack('>f', f)
    binary = ''.join(f'{b:08b}' for b in bits)
    hex_str = bits.hex().upper()
    
    sign = int(binary[0])
    exp = int(binary[1:9], 2)
    mant = binary[9:]
    
    if exp == 0 and int(mant) == 0:
        desc = "Zero"
    elif exp == 255 and int(mant) == 0:
        desc = "Infinity"
    elif exp == 255:
        desc = "NaN"
    elif exp == 0:
        desc = f"Denormalized: (-1)^{sign} × 2^{-126} × 0.{mant}"
    else:
        desc = f"Normal: (-1)^{sign} × 2^{exp-127} × 1.{mant}"
    
    return {
        "sign": sign, "exponent": exp, "mantissa": mant,
        "binary": binary, "hex": f"0x{hex_str}", "description": desc
    }

# Example
print(float_to_ieee754(-3.14))
# Output: sign=1, exponent=128, mantissa=100100011110...

Interview Questions

What is the IEEE 754 bias and why is it used?

The exponent in IEEE 754 uses a bias of 127 (single) or 1023 (double) to represent both positive and negative exponents without a sign bit. This allows simpler comparison of floating-point numbers as integer bit patterns. The actual exponent is stored_exponent - bias.

Explain the difference between normalized and denormalized numbers.

Normalized numbers have an implicit leading 1 before the binary point (1.M), while denormalized numbers have an implicit leading 0 (0.M). Denormals allow gradual underflow to zero, filling the gap between the smallest normal number and zero. They trade precision for the ability to represent extremely small values.

What causes floating point precision errors and how can they be mitigated?

Floating point errors occur because not all decimal fractions have exact binary representations (e.g., 0.1). Operations accumulate rounding errors. Mitigation: use double precision, avoid equality comparisons, use epsilon tolerances, consider decimal arithmetic (e.g., Python's Decimal module) for financial calculations.