đŦ SIMD & MIMD Architectures
Deep dive into Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD) architectures, including vector processing, memory models, and cache coherence.
SIMD Architecture
SIMD Processing
SIMD Architecture
Array Processors
Array processors (also called vector processors) are specialized SIMD machines. Examples: Cray-1 (1976), Thinking Machines CM-2, and modern GPU compute units. They excel at regular, data-parallel computations.
assembly
// Vector processor pseudo-code // Cray-1 style vector operations VLOAD V1, A ; Load vector A into V1 (64 elements) VLOAD V2, B ; Load vector B into V2 VADD V3, V1, V2 ; V3 = V1 + V2 (64 adds in parallel) VSTORE V3, C ; Store result to C // Chaining: pipeline vector operations VMUL V4, V3, V5 ; Multiply result with another vector // Results flow directly between vector units
MIMD Architecture
MIMD Processing
Shared Memory MIMD
All processors share a single address space. Communication is implicit via shared variables. Synchronization uses locks, semaphores, and barriers.
c
// Shared memory MIMD (pseudo-code)
shared int counter = 0;
lock mutex;
void worker(int id) {
while (1) {
acquire(mutex);
if (counter >= N) {
release(mutex);
break;
}
int work = counter++;
release(mutex);
process(work); // Each CPU works independently
}
}
// All CPUs run the same code but on different dataMemory Models in MIMD
UMA (Uniform Memory Access)
All processors share the same memory with equal access latency. Bus-based SMP systems.
Advantages
- â Simple programming model
- â Cache coherence easy
- â Low latency
Limitations
- â Limited scalability (bus contention)
- â Few processors (2-8)
Cache Coherence & MESI Protocol
Cache Coherence Problem
MESI Protocol State Machine
assembly
// MESI protocol transitions // Core 0 reads address X: I â E (Exclusive) // Core 1 reads same X: Core 0: E â S (Shared) // Core 1: I â S (Shared) // Core 0 writes to X: Core 0: S â M (Modified) // Core 1: S â I (Invalidate) // Core 1 reads X again: Core 0: M â S (Writeback + Share) // Core 1: I â S (Shared)
Code Example: MIMD with OpenMP
c
// OpenMP: Shared memory MIMD on CPU
#include <omp.h>
#define N 1000000
double sum_array(double *arr) {
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += arr[i]; // Each thread processes part
}
return sum;
}
// 4 cores â 4 threads â ~4x speedup
// Each thread: different instructions, different data
// True MIMD parallelismSIMD vs MIMD: Comparison
| Property | SIMD | MIMD |
|---|---|---|
| Control | Single (one instruction) | Multiple (per processor) |
| Data | Multiple elements | Multiple streams |
| Synchronization | Inherent (lockstep) | Explicit (locks/barriers) |
| Programming | Data-parallel languages | Any model (shared/message) |
| Hardware cost | Lower (shared CU) | Higher (per-processor logic) |
| Scalability | Limited by vector width | Hundreds-thousands of cores |
| Best for | Regular data parallelism | Irregular/task parallelism |
| Examples | GPU, vector processors | Multi-core CPU, clusters |
Interview Questions
What is the difference between SIMD and MIMD architectures?
SIMD uses a single control unit to broadcast one instruction to multiple processing elements, each operating on different data in lockstep. MIMD allows each processor to fetch and execute its own instruction stream on its own data. SIMD is simpler and efficient for data-parallel workloads; MIMD is more flexible for general-purpose parallel computing.
Explain UMA, NUMA, and CC-NUMA memory models.
UMA: all processors access shared memory with uniform latency. Simple but limited scalability. NUMA: each processor has local memory with fast access; remote access is slower. Scales better but programming is harder. CC-NUMA: NUMA with hardware cache coherence, transparent to software â combines scalability of NUMA with programmability of UMA.
What is the MESI cache coherence protocol?
MESI has four states: Modified (dirty, exclusive), Exclusive (clean, exclusive), Shared (clean, may be shared), Invalid (not in cache). When a processor writes, it invalidates other copies. State transitions are managed by a snooping cache controller watching the bus.
How do array processors differ from conventional SIMD?
Array processors are dedicated SIMD machines with many simple processing elements (PEs) controlled by one control unit. Conventional SIMD (SSE/AVX) extends a general-purpose CPU with vector registers and instructions. Array processors scale to hundreds of PEs; CPU SIMD is limited by vector register width (128-512 bits).