CampusFlow
ArchitectureSIMD & MIMD

đŸŽŦ SIMD & MIMD Architectures

Deep dive into Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD) architectures, including vector processing, memory models, and cache coherence.

SIMD Architecture

â„šī¸

SIMD Processing

SIMD architecture uses a single control unit that broadcasts instructions to multiple processing elements (PEs). Each PE has its own register set and executes the same instruction on different data. This is lockstep parallelism.

SIMD Architecture

Control Unit (Single Instruction)
PE 0
Data[0]
R0=A[i]+B[i]
PE 1
Data[1]
R1=A[i]+B[i]
PE 2
Data[2]
R2=A[i]+B[i]
PE 3
Data[3]
R3=A[i]+B[i]

Array Processors

Array processors (also called vector processors) are specialized SIMD machines. Examples: Cray-1 (1976), Thinking Machines CM-2, and modern GPU compute units. They excel at regular, data-parallel computations.

assembly

// Vector processor pseudo-code
// Cray-1 style vector operations

VLOAD V1, A      ; Load vector A into V1 (64 elements)
VLOAD V2, B      ; Load vector B into V2
VADD  V3, V1, V2 ; V3 = V1 + V2 (64 adds in parallel)
VSTORE V3, C     ; Store result to C

// Chaining: pipeline vector operations
VMUL  V4, V3, V5 ; Multiply result with another vector
// Results flow directly between vector units

MIMD Architecture

💡

MIMD Processing

MIMD is the most flexible parallel architecture. Each processor fetches its own instructions and operates on its own data. Processors communicate through shared memory or message passing.

Shared Memory MIMD

All processors share a single address space. Communication is implicit via shared variables. Synchronization uses locks, semaphores, and barriers.

c

// Shared memory MIMD (pseudo-code)
shared int counter = 0;
lock mutex;

void worker(int id) {
    while (1) {
        acquire(mutex);
        if (counter >= N) {
            release(mutex);
            break;
        }
        int work = counter++;
        release(mutex);
        process(work);  // Each CPU works independently
    }
}
// All CPUs run the same code but on different data

Memory Models in MIMD

UMA (Uniform Memory Access)

All processors share the same memory with equal access latency. Bus-based SMP systems.

Advantages

  • ✓ Simple programming model
  • ✓ Cache coherence easy
  • ✓ Low latency

Limitations

  • ✗ Limited scalability (bus contention)
  • ✗ Few processors (2-8)
Example: Dual-core desktop CPU

Cache Coherence & MESI Protocol

âš ī¸

Cache Coherence Problem

In shared-memory MIMD systems, multiple processors cache the same memory location. When one processor writes to its cache, the others must see the update. Without coherence, stale data is read.

MESI Protocol State Machine

M
Modified
E
Exclusive
S
Shared
I
Invalid
Private Read
→
Snoop
→
Invalidate
→
Writeback

assembly

// MESI protocol transitions
// Core 0 reads address X:   I → E (Exclusive)
// Core 1 reads same X:      Core 0: E → S (Shared)
//                            Core 1: I → S (Shared)
// Core 0 writes to X:       Core 0: S → M (Modified)
//                            Core 1: S → I (Invalidate)
// Core 1 reads X again:     Core 0: M → S (Writeback + Share)
//                            Core 1: I → S (Shared)

Code Example: MIMD with OpenMP

c

// OpenMP: Shared memory MIMD on CPU
#include <omp.h>
#define N 1000000

double sum_array(double *arr) {
    double sum = 0.0;
    #pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < N; i++) {
        sum += arr[i];  // Each thread processes part
    }
    return sum;
}

// 4 cores → 4 threads → ~4x speedup
// Each thread: different instructions, different data
// True MIMD parallelism

SIMD vs MIMD: Comparison

PropertySIMDMIMD
ControlSingle (one instruction)Multiple (per processor)
DataMultiple elementsMultiple streams
SynchronizationInherent (lockstep)Explicit (locks/barriers)
ProgrammingData-parallel languagesAny model (shared/message)
Hardware costLower (shared CU)Higher (per-processor logic)
ScalabilityLimited by vector widthHundreds-thousands of cores
Best forRegular data parallelismIrregular/task parallelism
ExamplesGPU, vector processorsMulti-core CPU, clusters

Interview Questions

What is the difference between SIMD and MIMD architectures?

SIMD uses a single control unit to broadcast one instruction to multiple processing elements, each operating on different data in lockstep. MIMD allows each processor to fetch and execute its own instruction stream on its own data. SIMD is simpler and efficient for data-parallel workloads; MIMD is more flexible for general-purpose parallel computing.

Explain UMA, NUMA, and CC-NUMA memory models.

UMA: all processors access shared memory with uniform latency. Simple but limited scalability. NUMA: each processor has local memory with fast access; remote access is slower. Scales better but programming is harder. CC-NUMA: NUMA with hardware cache coherence, transparent to software — combines scalability of NUMA with programmability of UMA.

What is the MESI cache coherence protocol?

MESI has four states: Modified (dirty, exclusive), Exclusive (clean, exclusive), Shared (clean, may be shared), Invalid (not in cache). When a processor writes, it invalidates other copies. State transitions are managed by a snooping cache controller watching the bus.

How do array processors differ from conventional SIMD?

Array processors are dedicated SIMD machines with many simple processing elements (PEs) controlled by one control unit. Conventional SIMD (SSE/AVX) extends a general-purpose CPU with vector registers and instructions. Array processors scale to hundreds of PEs; CPU SIMD is limited by vector register width (128-512 bits).