CampusFlow
ArchitectureGPU Basics

🎮 GPU Basics

Understand GPU architecture: CUDA cores, streaming multiprocessors, SIMT execution, memory hierarchy, and how GPUs achieve massive parallelism.

GPU vs CPU Architecture

ℹ️

Latency vs Throughput

CPUs are designed for low-latency sequential execution with large caches and complex control logic. GPUs are designed for high-throughput parallel execution with many simple cores, small caches, and massive thread-level parallelism.

CPU (Latency-Optimized)

  • • Few powerful cores (2-16)
  • • Large caches (L1-L3)
  • • Branch prediction + OoO
  • • ~1-4 threads per core
  • • Focus: single-thread performance

GPU (Throughput-Optimized)

  • • Thousands of simple cores
  • • Small caches per SM
  • • SIMT execution model
  • • ~64-256 threads per warp
  • • Focus: parallel throughput

CPU vs GPU Die Area Allocation

CPU
Control
Cache
ALU
GPU
Ctrl
Cache
ALU ALU ALU ALU ALU ALU

CUDA Cores & Streaming Multiprocessors

💡

GPU Building Blocks

A GPU consists of multiple Streaming Multiprocessors (SMs), each containing many CUDA cores (simple ALUs). Each SM has its own shared memory, register file, warp scheduler, and instruction cache.

Streaming Multiprocessor (SM) Internals

CUDA Core
1
CUDA Core
2
CUDA Core
3
CUDA Core
4
CUDA Core
5
CUDA Core
6
CUDA Core
7
CUDA Core
8
CUDA Core
9
CUDA Core
10
CUDA Core
11
CUDA Core
12
CUDA Core
13
CUDA Core
14
CUDA Core
15
CUDA Core
16
Shared Mem 48KB
Warp Scheduler
Register File 64K

SIMT Execution Model

ℹ️

SIMT vs SIMD

SIMT (Single Instruction, Multiple Threads) is NVIDIA's execution model. Like SIMD, one instruction is executed by multiple threads. Unlike SIMD, threads are independent — each has its own program counter, registers, and can branch independently.

Warp Execution

Threads are grouped into warps (32 threads on NVIDIA GPUs). A warp executes one instruction at a time. All 32 threads execute the same instruction on different data. If threads diverge (different branches), both paths are serialized.

T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
T24
T25
T26
T27
T28
T29
T30
T31

Warp of 32 threads executing same instruction

SIMT Execution Flow

GPU Memory Hierarchy

Global Memory

Size: 4-24 GB
Bandwidth: ~900 GB/s
Latency: ~400-800 cycles
⚠️

Memory Access Patterns

GPU performance is maximized when threads in a warp access coalesced memory — consecutive threads accessing consecutive addresses. Non-coalesced (strided/random) access wastes bandwidth and reduces throughput significantly.

Parallel Processing on GPU

💡

Massive Parallelism

A modern GPU can run thousands of threads simultaneously. Occupancy = active warps / max warps per SM. Higher occupancy hides memory latency by switching to ready warps while others wait.

c

// CUDA kernel concept — vector addition
// Runs on GPU with thousands of threads

__global__ void vecAdd(float *A, float *B, float *C, int N) {
    // Each thread handles one element
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];  // Thousands of adds in parallel
    }
}

// Launch 256 threads per block, enough blocks for all N
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

// Speedup: 50-200x over CPU for data-parallel code

Code Example: Matrix Multiplication

c

// GPU-accelerated matrix multiply (concept)
// C = A × B, each thread computes one element

__global__ void matMul(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    float sum = 0.0f;
    for (int k = 0; k < N; k++) {
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}

// Grid of 2D blocks: N×N threads total
dim3 blockSize(16, 16);
dim3 gridSize((N + 15)/16, (N + 15)/16);
matMul<<<gridSize, blockSize>>>(A, B, C, N);

// With shared memory tiling: 10-50x faster than CPU

GPU vs CPU: When to Use Each

CharacteristicCPUGPU
Cores4-161000-10000+
Clock speed3-5 GHz1-2 GHz
CacheLarge (MB)Small (KB/SM)
Memory BW~50 GB/s~900 GB/s
Branch handlingExcellentPoor (divergence)
Best forSequential, irregularParallel, regular
ProgrammabilityEasyComplex (CUDA/OpenCL)
Power efficiency~15 W/core~5 W/core

Interview Questions

What is the difference between CPU and GPU architecture?

CPUs are latency-optimized with few powerful cores, large caches, and complex control logic (branch prediction, OoO). GPUs are throughput-optimized with thousands of simple cores, small caches, and rely on massive thread-level parallelism to hide latency. CPUs excel at sequential tasks; GPUs excel at data-parallel workloads.

What is SIMT and how does it differ from SIMD?

SIMT (Single Instruction, Multiple Threads) is NVIDIA's execution model. Like SIMD, one instruction is executed by multiple threads. Unlike SIMD, each thread has its own program counter, registers, and can branch independently. SIMD operates on packed data elements in lockstep; SIMT operates on independent threads grouped into warps.

What is a warp and what happens on branch divergence?

A warp is a group of 32 threads that execute together on an SM. All threads in a warp execute the same instruction. When threads take different branches (divergence), all paths are serialized — one path executes while others are masked. This reduces utilization. Minimizing divergence is key to GPU performance.

Explain the GPU memory hierarchy and how it affects performance.

GPU memory hierarchy: global (high capacity, high latency, ~900 GB/s), shared (low latency, per-SM, manually managed), registers (fastest, per-thread), constant (cached, read-only), L1/L2 cache. Key optimization: keeping data in shared memory (30-cycle latency) instead of global memory (400+ cycles). Coalesced global memory access maximizes bandwidth utilization.