Architecture PipeliningPipeline Optimization

🚀 Pipeline Optimization

Advanced techniques to minimize pipeline stalls: branch prediction, superscalar execution, out-of-order processing, and register renaming.

Branch Prediction

⚠️

The Branch Problem

Branches (conditional jumps) introduce control hazards. The pipeline doesn't know which instruction to fetch next until the branch resolves in the EX stage. Without prediction, every branch causes a 2-cycle stall.

📊 Static Prediction

Always predict taken / not-taken based on opcode.

~60-70% accuracy

📈 Dynamic Prediction

Uses branch history table (BHT) to predict based on past behavior.

~90-95% accuracy

🤖 Advanced Predictors

Hybrid predictors combine local + global history (e.g., Tournament predictor).

~96-98% accuracy

2-Bit Saturating Counter

1-bit Saturating

2 states • ~80%

Predict taken if last branch was taken, else not taken.

2-bit Saturating

4 states • ~90%

Four states: Strongly Taken, Weakly Taken, Weakly Not Taken, Strongly Not Taken. Needs 2 mispredictions to flip.

3-bit Saturating

8 states • ~92%

More hysteresis, but diminishing returns. Rarely used beyond 2-3 bits.

Branch Target Buffer (BTB)

ℹ️

BTB

The BTB caches the target address of previously taken branches. When a branch is fetched, the BTB is looked up in parallel with instruction fetch. If found, the target PC is used for the next fetch — zero-cycle branch penalty on a hit.

BTB Entry Fields

PC of branch: tag for lookup
Target address: predicted next PC
Prediction bits: 2-bit saturating counter
Type: conditional / unconditional

BTB Operation

IF stage: Lookup PC in BTB
BTB hit: Predict target address
BTB miss: Use PC+4 (sequential)
Resolution: Update BTB on mispredict

Superpipeline & Superscalar

🔬 Superpipeline

Divide pipeline into finer stages (e.g., 10-20 stages vs 5). Each stage does less work, allowing higher clock frequency.

• More stages = higher clock frequency
• More pipeline registers = higher latency
• More stages = more hazards
• Example: MIPS R4000 had 8 stages

⚡ Superscalar

Multiple instructions issued per cycle using parallel functional units (e.g., 2 ALUs, 1 FPU, 1 load/store).

• Multiple pipelines in parallel
• IPC > 1 (instructions per cycle)
• Requires wide instruction fetch/decode
• Example: Intel Core i7 (4-6 wide)

Superscalar Execution (2-issue)

Cycle 1

ADD R1,R2,R3

SUB R4,R5,R6

Cycle 2

LW R7,0(R8)

AND R9,R10,R11

Cycle 3

OR R12,R13,R14

XOR R15,R16,R17

Out-of-Order Execution

💡

OoO Execution

In-order execution stalls when an instruction waits for data. Out-of-order execution allows later independent instructions to execute while earlier ones wait. Instructions are fetched in-order, executed out-of-order, and committed in-order.

🔄

Example: Hiding Load Latency

assembly

// In-order: stalls at LW
LW R1, 0(R2)      ; 3-cycle latency
ADD R3, R4, R5    ; stalled! waiting for R1
SUB R6, R7, R8    ; also stalled

// Out-of-order: SUB executes while LW waits
LW R1, 0(R2)      ; long-latency load starts
SUB R6, R7, R8    ; executes immediately (no dep)
ADD R3, R4, R5    ; also executes now
// LW completes, ADD commits in order

Register Renaming

ℹ️

Renaming

Register renaming eliminates false dependencies (WAR, WAW) by mapping architectural registers to a larger set of physical registers. Each instruction gets a new physical destination register, and the register map table (RMT) tracks the mapping.

Before Renaming

WAR hazard detected!

assembly

ADD R1, R2, R3  ; reads R2
SUB R2, R4, R5  ; writes R2 - WAR!
// Name dependency: different architectural reg

After Renaming

WAR eliminated!

assembly

// Physical regs: P1→R1, P2→R2, P3→R2_new
ADD P1, P2, P4    ; reads P2
SUB P3, P4, P5    ; writes P3 (not P2!)
// No dependency - both execute in parallel

Speculative Execution

⚠️

Speculation

Speculative execution executes instructions before the branch outcome is known. Results are held in a reorder buffer (ROB) and committed only when the branch resolves correctly. Mispredicted instructions are flushed — this is called "pipeline flush" or "branch misprediction penalty."

assembly

// Speculative execution example
BEQZ R1, target    ; branch (resolves in 3 cycles)
// Pre-branch - always executed
ADD R2, R3, R4

// Speculatively executed (predicted taken)
target:
  LW R5, 0(R6)     ; speculative load
  SUB R7, R8, R9   ; speculative sub
  // Stored in ROB, not committed yet

// If mispredicted: ROB flushed, pipeline redirected
// Penalty: ~15-20 cycles on modern CPUs

Code Example: Performance Comparison

python

// Pipeline comparison: scalar vs superscalar
// n = 8 instructions, k = 5 stages

def pipeline_cycles(n, k, issue_width):
    """Calculate cycles for n instructions"""
    return (n + k - 1) / issue_width

scalar = pipeline_cycles(8, 5, 1)
super2 = pipeline_cycles(8, 5, 2)
super4 = pipeline_cycles(8, 5, 4)

print(f"Scalar (1-wide):    {scalar} cycles = CPI 1.0")
print(f"Superscalar (2-wide): {super2} cycles = CPI 0.5")
print(f"Superscalar (4-wide): {super4} cycles = CPI 0.25")
print(f"Speedup (4-wide): {scalar/super4:.0f}x")

Interview Questions

How does a 2-bit saturating counter work for branch prediction?

A 2-bit saturating counter has four states: Strongly Not Taken (00), Weakly Not Taken (01), Weakly Taken (10), Strongly Taken (11). When a branch is taken, the counter increments (saturating at 11). When not taken, it decrements (saturating at 00). Two mispredictions are needed to flip the prediction, providing hysteresis against loop-termination branches.

What is the difference between superpipeline and superscalar?

Superpipeline divides each stage into finer substages, increasing clock frequency. Superscalar replicates functional units to issue multiple instructions per cycle. A superpipelined processor has a deeper pipeline (more stages). A superscalar processor has wider pipelines (more parallel units). Many modern CPUs use both.

How does register renaming eliminate WAR and WAW hazards?

Register renaming maps architectural registers to a larger pool of physical registers. Each instruction that writes to a register gets a new unique physical register. The register map table (RMT) tracks the latest mapping. Since no two in-flight instructions share the same physical destination register, false dependencies (WAR, WAW) are eliminated.

What happens during a branch misprediction in a superscalar processor?

The mispredicted branch and all subsequent speculatively executed instructions must be flushed from the pipeline. The reorder buffer (ROB) entries are cleared, the register map table is restored to the state before the branch, and the correct target address is fetched. Modern CPUs incur a 15-20 cycle penalty for mispredictions.