🚀 Pipeline Optimization
Advanced techniques to minimize pipeline stalls: branch prediction, superscalar execution, out-of-order processing, and register renaming.
Branch Prediction
The Branch Problem
📊 Static Prediction
Always predict taken / not-taken based on opcode.
~60-70% accuracy📈 Dynamic Prediction
Uses branch history table (BHT) to predict based on past behavior.
~90-95% accuracy🤖 Advanced Predictors
Hybrid predictors combine local + global history (e.g., Tournament predictor).
~96-98% accuracy2-Bit Saturating Counter
Predict taken if last branch was taken, else not taken.
Four states: Strongly Taken, Weakly Taken, Weakly Not Taken, Strongly Not Taken. Needs 2 mispredictions to flip.
More hysteresis, but diminishing returns. Rarely used beyond 2-3 bits.
Branch Target Buffer (BTB)
BTB
BTB Entry Fields
- PC of branch: tag for lookup
- Target address: predicted next PC
- Prediction bits: 2-bit saturating counter
- Type: conditional / unconditional
BTB Operation
- IF stage: Lookup PC in BTB
- BTB hit: Predict target address
- BTB miss: Use PC+4 (sequential)
- Resolution: Update BTB on mispredict
Superpipeline & Superscalar
🔬 Superpipeline
Divide pipeline into finer stages (e.g., 10-20 stages vs 5). Each stage does less work, allowing higher clock frequency.
- • More stages = higher clock frequency
- • More pipeline registers = higher latency
- • More stages = more hazards
- • Example: MIPS R4000 had 8 stages
⚡ Superscalar
Multiple instructions issued per cycle using parallel functional units (e.g., 2 ALUs, 1 FPU, 1 load/store).
- • Multiple pipelines in parallel
- • IPC > 1 (instructions per cycle)
- • Requires wide instruction fetch/decode
- • Example: Intel Core i7 (4-6 wide)
Superscalar Execution (2-issue)
Out-of-Order Execution
OoO Execution
Example: Hiding Load Latency
assembly
// In-order: stalls at LW LW R1, 0(R2) ; 3-cycle latency ADD R3, R4, R5 ; stalled! waiting for R1 SUB R6, R7, R8 ; also stalled // Out-of-order: SUB executes while LW waits LW R1, 0(R2) ; long-latency load starts SUB R6, R7, R8 ; executes immediately (no dep) ADD R3, R4, R5 ; also executes now // LW completes, ADD commits in order
Register Renaming
Renaming
Before Renaming
WAR hazard detected!
assembly
ADD R1, R2, R3 ; reads R2 SUB R2, R4, R5 ; writes R2 - WAR! // Name dependency: different architectural reg
After Renaming
WAR eliminated!
assembly
// Physical regs: P1→R1, P2→R2, P3→R2_new ADD P1, P2, P4 ; reads P2 SUB P3, P4, P5 ; writes P3 (not P2!) // No dependency - both execute in parallel
Speculative Execution
Speculation
assembly
// Speculative execution example BEQZ R1, target ; branch (resolves in 3 cycles) // Pre-branch - always executed ADD R2, R3, R4 // Speculatively executed (predicted taken) target: LW R5, 0(R6) ; speculative load SUB R7, R8, R9 ; speculative sub // Stored in ROB, not committed yet // If mispredicted: ROB flushed, pipeline redirected // Penalty: ~15-20 cycles on modern CPUs
Code Example: Performance Comparison
python
// Pipeline comparison: scalar vs superscalar
// n = 8 instructions, k = 5 stages
def pipeline_cycles(n, k, issue_width):
"""Calculate cycles for n instructions"""
return (n + k - 1) / issue_width
scalar = pipeline_cycles(8, 5, 1)
super2 = pipeline_cycles(8, 5, 2)
super4 = pipeline_cycles(8, 5, 4)
print(f"Scalar (1-wide): {scalar} cycles = CPI 1.0")
print(f"Superscalar (2-wide): {super2} cycles = CPI 0.5")
print(f"Superscalar (4-wide): {super4} cycles = CPI 0.25")
print(f"Speedup (4-wide): {scalar/super4:.0f}x")Interview Questions
How does a 2-bit saturating counter work for branch prediction?
A 2-bit saturating counter has four states: Strongly Not Taken (00), Weakly Not Taken (01), Weakly Taken (10), Strongly Taken (11). When a branch is taken, the counter increments (saturating at 11). When not taken, it decrements (saturating at 00). Two mispredictions are needed to flip the prediction, providing hysteresis against loop-termination branches.
What is the difference between superpipeline and superscalar?
Superpipeline divides each stage into finer substages, increasing clock frequency. Superscalar replicates functional units to issue multiple instructions per cycle. A superpipelined processor has a deeper pipeline (more stages). A superscalar processor has wider pipelines (more parallel units). Many modern CPUs use both.
How does register renaming eliminate WAR and WAW hazards?
Register renaming maps architectural registers to a larger pool of physical registers. Each instruction that writes to a register gets a new unique physical register. The register map table (RMT) tracks the latest mapping. Since no two in-flight instructions share the same physical destination register, false dependencies (WAR, WAW) are eliminated.
What happens during a branch misprediction in a superscalar processor?
The mispredicted branch and all subsequent speculatively executed instructions must be flushed from the pipeline. The reorder buffer (ROB) entries are cleared, the register map table is restored to the state before the branch, and the correct target address is fetched. Modern CPUs incur a 15-20 cycle penalty for mispredictions.