🎮 GPU Basics
Understand GPU architecture: CUDA cores, streaming multiprocessors, SIMT execution, memory hierarchy, and how GPUs achieve massive parallelism.
GPU vs CPU Architecture
Latency vs Throughput
CPU (Latency-Optimized)
- • Few powerful cores (2-16)
- • Large caches (L1-L3)
- • Branch prediction + OoO
- • ~1-4 threads per core
- • Focus: single-thread performance
GPU (Throughput-Optimized)
- • Thousands of simple cores
- • Small caches per SM
- • SIMT execution model
- • ~64-256 threads per warp
- • Focus: parallel throughput
CPU vs GPU Die Area Allocation
CUDA Cores & Streaming Multiprocessors
GPU Building Blocks
Streaming Multiprocessor (SM) Internals
SIMT Execution Model
SIMT vs SIMD
Warp Execution
Threads are grouped into warps (32 threads on NVIDIA GPUs). A warp executes one instruction at a time. All 32 threads execute the same instruction on different data. If threads diverge (different branches), both paths are serialized.
Warp of 32 threads executing same instruction
SIMT Execution Flow
GPU Memory Hierarchy
Global Memory
Memory Access Patterns
Parallel Processing on GPU
Massive Parallelism
c
// CUDA kernel concept — vector addition
// Runs on GPU with thousands of threads
__global__ void vecAdd(float *A, float *B, float *C, int N) {
// Each thread handles one element
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i]; // Thousands of adds in parallel
}
}
// Launch 256 threads per block, enough blocks for all N
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Speedup: 50-200x over CPU for data-parallel codeCode Example: Matrix Multiplication
c
// GPU-accelerated matrix multiply (concept)
// C = A × B, each thread computes one element
__global__ void matMul(float *A, float *B, float *C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
// Grid of 2D blocks: N×N threads total
dim3 blockSize(16, 16);
dim3 gridSize((N + 15)/16, (N + 15)/16);
matMul<<<gridSize, blockSize>>>(A, B, C, N);
// With shared memory tiling: 10-50x faster than CPUGPU vs CPU: When to Use Each
| Characteristic | CPU | GPU |
|---|---|---|
| Cores | 4-16 | 1000-10000+ |
| Clock speed | 3-5 GHz | 1-2 GHz |
| Cache | Large (MB) | Small (KB/SM) |
| Memory BW | ~50 GB/s | ~900 GB/s |
| Branch handling | Excellent | Poor (divergence) |
| Best for | Sequential, irregular | Parallel, regular |
| Programmability | Easy | Complex (CUDA/OpenCL) |
| Power efficiency | ~15 W/core | ~5 W/core |
Interview Questions
What is the difference between CPU and GPU architecture?
CPUs are latency-optimized with few powerful cores, large caches, and complex control logic (branch prediction, OoO). GPUs are throughput-optimized with thousands of simple cores, small caches, and rely on massive thread-level parallelism to hide latency. CPUs excel at sequential tasks; GPUs excel at data-parallel workloads.
What is SIMT and how does it differ from SIMD?
SIMT (Single Instruction, Multiple Threads) is NVIDIA's execution model. Like SIMD, one instruction is executed by multiple threads. Unlike SIMD, each thread has its own program counter, registers, and can branch independently. SIMD operates on packed data elements in lockstep; SIMT operates on independent threads grouped into warps.
What is a warp and what happens on branch divergence?
A warp is a group of 32 threads that execute together on an SM. All threads in a warp execute the same instruction. When threads take different branches (divergence), all paths are serialized — one path executes while others are masked. This reduces utilization. Minimizing divergence is key to GPU performance.
Explain the GPU memory hierarchy and how it affects performance.
GPU memory hierarchy: global (high capacity, high latency, ~900 GB/s), shared (low latency, per-SM, manually managed), registers (fastest, per-thread), constant (cached, read-only), L1/L2 cache. Key optimization: keeping data in shared memory (30-cycle latency) instead of global memory (400+ cycles). Coalesced global memory access maximizes bandwidth utilization.