ArchitectureMultiprocessors

⚙️ Multiprocessors

Systems with multiple processors working together. Explore tight vs loose coupling, SMP, NUMA, Amdahl's Law, and parallel performance analysis.

Tightly vs Loosely Coupled Systems

Symmetric Multiprocessing (SMP)

ℹ️

SMP Architecture

In SMP, two or more identical processors share a single memory space via a shared bus. Each processor has equal access to all memory (UMA). All processors run the same OS and can execute any process.

SMP System

CPU 0

L1 + L2 cache

CPU 1

L1 + L2 cache

CPU 2

L1 + L2 cache

CPU 3

L1 + L2 cache

Shared Bus / Crossbar

Shared Main Memory

Non-Uniform Memory Access (NUMA)

💡

NUMA

In NUMA, each processor has its own local memory. Accessing local memory is fast; accessing remote memory (another processor's memory) is slower. NUMA improves scalability over SMP by eliminating bus contention.

Local Access

Memory attached to the same processor node. Fast (e.g., 100ns latency). No interconnect overhead.

Remote Access

Memory attached to another processor node. Slower (e.g., 200-300ns). Traverses the interconnect.

// NUMA-aware programming (Linux)
// libnuma: bind memory to specific nodes

#include <numa.h>

void numa_example() {
    int node = 0;  // NUMA node
    void *mem = numa_alloc_local(1024 * 1024);
    // Memory allocated on local node - fast access
    
    // Bind thread to specific CPU
    struct bitmask *mask = numa_allocate_cpumask();
    numa_bitmask_setbit(mask, 0);  // CPU 0
    numa_sched_setaffinity(0, mask);
}

Amdahl's Law

ℹ️

Amdahl's Law

Amdahl's Law states that the maximum speedup of a parallel system is limited by the fraction of the program that must be executed serially. Speedup = 1 / (P + (1-P)/N) where P is the serial fraction and N is the number of processors.

Speedup = 1 / (P + (1-P) / N)

P = serial fraction, N = number of processors

💡

Key Insight

Even with infinite processors, speedup is limited to 1/P. If 10% of a program is serial, the maximum speedup is 10x — no matter how many cores you add. This is why optimizing the serial portion is critical.

Serial %	2 cores	4 cores	8 cores	16 cores	∞ cores
1%	1.98x	3.88x	7.48x	13.91x	100.0x
5%	1.90x	3.48x	5.93x	9.14x	20.0x
10%	1.82x	3.08x	4.71x	6.40x	10.0x
25%	1.60x	2.29x	2.91x	3.37x	4.0x
50%	1.33x	1.60x	1.78x	1.88x	2.0x

Parallel Speedup & Efficiency

ℹ️

Key Metrics

Speedup = T(1) / T(N) — how much faster with N processors. Efficiency = Speedup / N — how well processors are utilized. Linear speedup (speedup = N) means perfect scaling. Sub-linear speedup (speedup < N) indicates overhead or serial bottlenecks.

Linear Speedup

Perfect scaling. Every processor adds full value.

S = N100%

Sub-linear

Overhead from communication, sync, or serial sections.

S < N< 100%

Super-linear

Cache effects or memory access improvements cause >N speedup.

S > N> 100%

Code Example: OpenMP Parallelism

// OpenMP: shared-memory multiprocessing
#include <omp.h>
#include <stdio.h>

int main() {
    int n = 1000000;
    double sum = 0.0;
    double *arr = malloc(n * sizeof(double));
    
    // Initialize
    for (int i = 0; i < n; i++) arr[i] = 1.0;
    
    // Parallel reduction with 4 threads
    #pragma omp parallel for num_threads(4) reduction(+:sum)
    for (int i = 0; i < n; i++) {
        sum += arr[i];  // Each thread handles 250K elements
    }
    
    printf("Sum = %f\n", sum);
    printf("Threads: %d, Speedup: ~%dx\n",
           omp_get_max_threads(), omp_get_max_threads());
    return 0;
}

Multiprocessor Challenges

🔄

Cache Coherence

Multiple caches may hold stale copies of shared data. Solution: snooping protocols (MESI) or directory-based protocols.

📋

Memory Consistency

Order in which memory operations appear to different processors. Models: Sequential Consistency, Release Consistency.

🔒

Synchronization

Locks, semaphores, barriers needed to coordinate access. Can become a bottleneck at scale.

⚖️

Load Balancing

Uneven work distribution leaves some processors idle. Dynamic scheduling helps but adds overhead.

Interview Questions

What is the difference between tightly coupled and loosely coupled multiprocessors?

Tightly coupled systems share a single memory accessed equally by all processors via a bus/crossbar (SMP). Fast communication but limited scalability. Loosely coupled systems have private memory per processor, communicating via message passing (clusters). High scalability but slower communication and explicit programming required.

Explain Amdahl's Law and its implications.

Amdahl's Law: Speedup = 1/(P + (1-P)/N) where P is the serial fraction. As N → ∞, max speedup → 1/P. Even 10% serial code limits speedup to 10x regardless of cores. The implication: identify and optimize the serial bottleneck before adding more processors.

What is SMP and how does it differ from NUMA?

SMP (Symmetric Multiprocessing): all processors share a single memory with uniform access latency via a shared bus. NUMA (Non-Uniform Memory Access): each processor has local memory (fast) and can access remote memory (slower). SMP is simpler but doesn't scale beyond 8-16 processors. NUMA scales to hundreds.

What factors cause sub-linear speedup in parallel systems?

Sub-linear speedup is caused by: (1) Serial sections — Amdahl's Law limits apply. (2) Communication overhead — processors spend time sending/receiving data. (3) Synchronization overhead — locks and barriers add waiting time. (4) Load imbalance — some processors finish early and idle. (5) Resource contention — bus/memory conflicts increase with more processors.

⚙️ Multiprocessors

Tightly vs Loosely Coupled Systems

Tightly Coupled

Loosely Coupled

Symmetric Multiprocessing (SMP)

SMP Architecture

SMP System

Non-Uniform Memory Access (NUMA)

NUMA

Local Access

Remote Access

Amdahl's Law

Amdahl's Law

Key Insight

Parallel Speedup & Efficiency

Key Metrics

Linear Speedup

Sub-linear

Super-linear

Code Example: OpenMP Parallelism

Multiprocessor Challenges

Cache Coherence

Memory Consistency

Synchronization

Load Balancing

Interview Questions