🚀 Parallel Processing

Breaking problems into smaller tasks that execute simultaneously. Explore types of parallelism, Amdahl's and Gustafson's Laws, shared vs distributed memory, and parallel programming models.

What is Parallel Processing?

ℹ️

Core Concept

Parallel processing is the simultaneous use of multiple compute resources to solve a computational problem. Instead of doing one thing at a time, the problem is broken into smaller independent parts that execute concurrently, dramatically reducing total execution time.

Why Parallel?

Clock speeds plateaued (~4 GHz). Transistor density still grows but power limits prevent faster single cores. Parallelism is the only path to continued performance gains.

Where Used?

Scientific computing, AI/ML training, graphics rendering, databases, web servers, video encoding — virtually all performance-critical computing.

Key Challenge

Not all code can be parallelized. Dependencies, synchronization, communication overhead, and load imbalance limit speedup.

Decompose

→

Assign

→

Execute

→

Combine

→

Output

Types of Parallelism

⚡

Instruction-Level (ILP)

Multiple instructions executed simultaneously within a single core. Achieved via pipelining, superscalar execution, and out-of-order execution.

Modern CPUs execute 4-6 instructions per cycle

📊

Data-Level (DLP)

Same operation performed on multiple data elements simultaneously. SIMD instructions (SSE, AVX) and vector processors.

AVX-512: 16 single-precision ops per instruction

🧵

Thread-Level (TLP)

Multiple threads or processes running concurrently on multiple cores or processors. Includes hyperthreading and multicore.

16-core CPU running 32 threads via SMT

💡

Flynn's Taxonomy

Parallel architectures are classified by Flynn's Taxonomy: SISD (single instruction, single data — classic uniprocessor), SIMD (vector processors, GPUs), MISD (rare — fault-tolerant systems), MIMD (multicore CPUs, clusters — most common parallel architecture).

Amdahl's Law & Gustafson's Law

Amdahl's Law (Fixed-Size)

S = 1 / (P + (1-P)/N)

Speedup bounded by serial fraction. Adding cores gives diminishing returns.

Gustafson's Law (Scaled-Speed)

S = N - (N-1) × P

Larger problems can use more processors. Speedup scales with problem size.

Number of Cores: 8

141664256

Serial Fraction: 10%

0%25%50%75%100%

Amdahl Speedup

4.71x

fixed problem size

Gustafson Speedup

7.30x

scaled problem size

Max Amdahl Speedup

10.0x

infinite cores limit

Parallel Portion: 90% — Cores Available: 8

Serial: 10%Parallel: 90%

Shared Memory vs Distributed Memory

Shared vs Distributed Memory Architecture

CPU0

CPU1

CPU2

CPU3

Shared Bus

Shared Memory

UMA / SMP

CPU0

Mem0

CPU1

Mem1

CPU2

Mem2

CPU3

Mem3

Interconnect Network

Message Passing

Parallel Programming Models

Code Example: Parallel Sum

Comparing sequential and parallel summation of an array. The parallel version splits the work across multiple threads and combines results.

c

// Parallel sum using OpenMP (shared memory)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define N 100000000  // 100 million elements

int main() {
    double *data = (double*)malloc(N * sizeof(double));
    double sum = 0.0;
    
    // Initialize with test data
    for (int i = 0; i < N; i++) data[i] = 1.0;
    
    // Sequential sum (baseline)
    double t0 = omp_get_wtime();
    for (int i = 0; i < N; i++) sum += data[i];
    double t1 = omp_get_wtime();
    printf("Sequential: sum=%.0f, time=%.3fs\n", sum, t1 - t0);
    
    // Parallel sum with reduction
    sum = 0.0;
    t0 = omp_get_wtime();
    #pragma omp parallel for num_threads(8) reduction(+:sum)
    for (int i = 0; i < N; i++) {
        sum += data[i];  // Each thread handles N/8 elements
    }
    t1 = omp_get_wtime();
    printf("Parallel (8 threads): sum=%.0f, time=%.3fs\n", sum, t1 - t0);
    printf("Speedup: %.2fx\n", (t1 - t0) > 0 ? (t1 - t0) / (t1 - t0) : 0);
    
    free(data);
    return 0;
}

💡

MPI Parallel Sum

In distributed memory systems, each MPI process computes a partial sum on its local data, then uses MPI_Reduce to combine all partial sums into the final result on the root process.

c

// Parallel sum using MPI (distributed memory)
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

#define N 100000000

int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    
    int chunk = N / size;
    double *local = (double*)malloc(chunk * sizeof(double));
    for (int i = 0; i < chunk; i++) local[i] = 1.0;
    
    double local_sum = 0.0, global_sum = 0.0;
    for (int i = 0; i < chunk; i++) local_sum += local[i];
    
    // Reduce all partial sums to root (rank 0)
    MPI_Reduce(&local_sum, &global_sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    
    if (rank == 0)
        printf("Global sum = %.0f (computed by %d processes)\n", global_sum, size);
    
    MPI_Finalize();
    return 0;
}

Interview Questions

What is the difference between Amdahl's Law and Gustafson's Law?

Amdahl's Law assumes a fixed problem size and predicts diminishing returns as processors increase: S = 1/(P + (1-P)/N). Gustafson's Law assumes the problem size scales with available processors: S = N - (N-1)P. Gustafson is more optimistic — it reflects real-world usage where larger systems tackle larger problems, keeping the serial fraction proportionally smaller.

Explain instruction-level, data-level, and thread-level parallelism.

ILP (Instruction-Level): executing multiple instructions per cycle within one core via pipelining, superscalar, out-of-order execution. DLP (Data-Level): applying one operation to multiple data elements simultaneously (SIMD/vector instructions). TLP (Thread-Level): running multiple threads/processes on multiple cores, including SMT/hyperthreading. Modern CPUs exploit all three levels simultaneously.

What are the trade-offs between shared memory and distributed memory systems?

Shared memory: easier to program (global address space), fast implicit communication, but limited scalability (bus contention, cache coherence overhead). Distributed memory: highly scalable (each node has private memory), no coherence issues, but requires explicit message passing (MPI), higher latency, and manual data partitioning. Modern supercomputers use hybrid approaches with both.

How does the parallel sum algorithm work and what limits its speedup?

The parallel sum splits the array across threads/processes. Each computes a partial sum, then a reduction combines all partial sums. Speedup is limited by: (1) serial portion — the final reduction is inherently serial (O(log P) with tree reduction). (2) Load imbalance — if data distribution is uneven. (3) Memory bandwidth — all threads compete for memory. (4) False sharing — threads on the same cache line cause invalidation overhead.