๐ Parallel Processing
Breaking problems into smaller tasks that execute simultaneously. Explore types of parallelism, Amdahl's and Gustafson's Laws, shared vs distributed memory, and parallel programming models.
What is Parallel Processing?
Core Concept
Why Parallel?
Clock speeds plateaued (~4 GHz). Transistor density still grows but power limits prevent faster single cores. Parallelism is the only path to continued performance gains.
Where Used?
Scientific computing, AI/ML training, graphics rendering, databases, web servers, video encoding โ virtually all performance-critical computing.
Key Challenge
Not all code can be parallelized. Dependencies, synchronization, communication overhead, and load imbalance limit speedup.
Types of Parallelism
Instruction-Level (ILP)
Multiple instructions executed simultaneously within a single core. Achieved via pipelining, superscalar execution, and out-of-order execution.
Data-Level (DLP)
Same operation performed on multiple data elements simultaneously. SIMD instructions (SSE, AVX) and vector processors.
Thread-Level (TLP)
Multiple threads or processes running concurrently on multiple cores or processors. Includes hyperthreading and multicore.
Flynn's Taxonomy
Amdahl's Law & Gustafson's Law
Amdahl's Law (Fixed-Size)
Speedup bounded by serial fraction. Adding cores gives diminishing returns.
Gustafson's Law (Scaled-Speed)
Larger problems can use more processors. Speedup scales with problem size.
Shared Memory vs Distributed Memory
Shared vs Distributed Memory Architecture
Parallel Programming Models
Code Example: Parallel Sum
Comparing sequential and parallel summation of an array. The parallel version splits the work across multiple threads and combines results.
c
// Parallel sum using OpenMP (shared memory)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N 100000000 // 100 million elements
int main() {
double *data = (double*)malloc(N * sizeof(double));
double sum = 0.0;
// Initialize with test data
for (int i = 0; i < N; i++) data[i] = 1.0;
// Sequential sum (baseline)
double t0 = omp_get_wtime();
for (int i = 0; i < N; i++) sum += data[i];
double t1 = omp_get_wtime();
printf("Sequential: sum=%.0f, time=%.3fs\n", sum, t1 - t0);
// Parallel sum with reduction
sum = 0.0;
t0 = omp_get_wtime();
#pragma omp parallel for num_threads(8) reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += data[i]; // Each thread handles N/8 elements
}
t1 = omp_get_wtime();
printf("Parallel (8 threads): sum=%.0f, time=%.3fs\n", sum, t1 - t0);
printf("Speedup: %.2fx\n", (t1 - t0) > 0 ? (t1 - t0) / (t1 - t0) : 0);
free(data);
return 0;
}MPI Parallel Sum
c
// Parallel sum using MPI (distributed memory)
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define N 100000000
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int chunk = N / size;
double *local = (double*)malloc(chunk * sizeof(double));
for (int i = 0; i < chunk; i++) local[i] = 1.0;
double local_sum = 0.0, global_sum = 0.0;
for (int i = 0; i < chunk; i++) local_sum += local[i];
// Reduce all partial sums to root (rank 0)
MPI_Reduce(&local_sum, &global_sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0)
printf("Global sum = %.0f (computed by %d processes)\n", global_sum, size);
MPI_Finalize();
return 0;
}Interview Questions
What is the difference between Amdahl's Law and Gustafson's Law?
Amdahl's Law assumes a fixed problem size and predicts diminishing returns as processors increase: S = 1/(P + (1-P)/N). Gustafson's Law assumes the problem size scales with available processors: S = N - (N-1)P. Gustafson is more optimistic โ it reflects real-world usage where larger systems tackle larger problems, keeping the serial fraction proportionally smaller.
Explain instruction-level, data-level, and thread-level parallelism.
ILP (Instruction-Level): executing multiple instructions per cycle within one core via pipelining, superscalar, out-of-order execution. DLP (Data-Level): applying one operation to multiple data elements simultaneously (SIMD/vector instructions). TLP (Thread-Level): running multiple threads/processes on multiple cores, including SMT/hyperthreading. Modern CPUs exploit all three levels simultaneously.
What are the trade-offs between shared memory and distributed memory systems?
Shared memory: easier to program (global address space), fast implicit communication, but limited scalability (bus contention, cache coherence overhead). Distributed memory: highly scalable (each node has private memory), no coherence issues, but requires explicit message passing (MPI), higher latency, and manual data partitioning. Modern supercomputers use hybrid approaches with both.
How does the parallel sum algorithm work and what limits its speedup?
The parallel sum splits the array across threads/processes. Each computes a partial sum, then a reduction combines all partial sums. Speedup is limited by: (1) serial portion โ the final reduction is inherently serial (O(log P) with tree reduction). (2) Load imbalance โ if data distribution is uneven. (3) Memory bandwidth โ all threads compete for memory. (4) False sharing โ threads on the same cache line cause invalidation overhead.