ArchitectureGPU Basics

🎮 GPU Basics

Understand GPU architecture: CUDA cores, streaming multiprocessors, SIMT execution, memory hierarchy, and how GPUs achieve massive parallelism.

GPU vs CPU Architecture

ℹ️

Latency vs Throughput

CPUs are designed for low-latency sequential execution with large caches and complex control logic. GPUs are designed for high-throughput parallel execution with many simple cores, small caches, and massive thread-level parallelism.

CPU (Latency-Optimized)

• Few powerful cores (2-16)
• Large caches (L1-L3)
• Branch prediction + OoO
• ~1-4 threads per core
• Focus: single-thread performance

GPU (Throughput-Optimized)

• Thousands of simple cores
• Small caches per SM
• SIMT execution model
• ~64-256 threads per warp
• Focus: parallel throughput

CPU vs GPU Die Area Allocation

CPU

Control

Cache

ALU

GPU

Ctrl

Cache

ALU ALU ALU ALU ALU ALU

CUDA Cores & Streaming Multiprocessors

💡

GPU Building Blocks

A GPU consists of multiple Streaming Multiprocessors (SMs), each containing many CUDA cores (simple ALUs). Each SM has its own shared memory, register file, warp scheduler, and instruction cache.

Streaming Multiprocessor (SM) Internals

CUDA Core

Shared Mem 48KB

Warp Scheduler

SIMT Execution Model

ℹ️

SIMT vs SIMD

SIMT (Single Instruction, Multiple Threads) is NVIDIA's execution model. Like SIMD, one instruction is executed by multiple threads. Unlike SIMD, threads are independent — each has its own program counter, registers, and can branch independently.

Warp Execution

Threads are grouped into warps (32 threads on NVIDIA GPUs). A warp executes one instruction at a time. All 32 threads execute the same instruction on different data. If threads diverge (different branches), both paths are serialized.

T10

T11

T12

T13

T14

T15

T16

T17

T18

T19

T20

T21

T22

T23

T24

T25

T26

T27

T28

T29

T30

T31

Warp of 32 threads executing same instruction

SIMT Execution Flow

GPU Memory Hierarchy

Global Memory

Size: 4-24 GB

Bandwidth: ~900 GB/s

Latency: ~400-800 cycles

⚠️

Memory Access Patterns

GPU performance is maximized when threads in a warp access coalesced memory — consecutive threads accessing consecutive addresses. Non-coalesced (strided/random) access wastes bandwidth and reduces throughput significantly.

Parallel Processing on GPU

💡

Massive Parallelism

A modern GPU can run thousands of threads simultaneously. Occupancy = active warps / max warps per SM. Higher occupancy hides memory latency by switching to ready warps while others wait.

// CUDA kernel concept — vector addition
// Runs on GPU with thousands of threads

__global__ void vecAdd(float *A, float *B, float *C, int N) {
    // Each thread handles one element
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];  // Thousands of adds in parallel
    }
}

// Launch 256 threads per block, enough blocks for all N
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

// Speedup: 50-200x over CPU for data-parallel code

Code Example: Matrix Multiplication

// GPU-accelerated matrix multiply (concept)
// C = A × B, each thread computes one element

__global__ void matMul(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    float sum = 0.0f;
    for (int k = 0; k < N; k++) {
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}

// Grid of 2D blocks: N×N threads total
dim3 blockSize(16, 16);
dim3 gridSize((N + 15)/16, (N + 15)/16);
matMul<<<gridSize, blockSize>>>(A, B, C, N);

// With shared memory tiling: 10-50x faster than CPU

GPU vs CPU: When to Use Each

Characteristic	CPU	GPU
Cores	4-16	1000-10000+
Clock speed	3-5 GHz	1-2 GHz
Cache	Large (MB)	Small (KB/SM)
Memory BW	~50 GB/s	~900 GB/s
Branch handling	Excellent	Poor (divergence)
Best for	Sequential, irregular	Parallel, regular
Programmability	Easy	Complex (CUDA/OpenCL)
Power efficiency	~15 W/core	~5 W/core

Interview Questions

What is the difference between CPU and GPU architecture?

CPUs are latency-optimized with few powerful cores, large caches, and complex control logic (branch prediction, OoO). GPUs are throughput-optimized with thousands of simple cores, small caches, and rely on massive thread-level parallelism to hide latency. CPUs excel at sequential tasks; GPUs excel at data-parallel workloads.

What is SIMT and how does it differ from SIMD?

SIMT (Single Instruction, Multiple Threads) is NVIDIA's execution model. Like SIMD, one instruction is executed by multiple threads. Unlike SIMD, each thread has its own program counter, registers, and can branch independently. SIMD operates on packed data elements in lockstep; SIMT operates on independent threads grouped into warps.

What is a warp and what happens on branch divergence?

A warp is a group of 32 threads that execute together on an SM. All threads in a warp execute the same instruction. When threads take different branches (divergence), all paths are serialized — one path executes while others are masked. This reduces utilization. Minimizing divergence is key to GPU performance.

Explain the GPU memory hierarchy and how it affects performance.

GPU memory hierarchy: global (high capacity, high latency, ~900 GB/s), shared (low latency, per-SM, manually managed), registers (fastest, per-thread), constant (cached, read-only), L1/L2 cache. Key optimization: keeping data in shared memory (30-cycle latency) instead of global memory (400+ cycles). Coalesced global memory access maximizes bandwidth utilization.

🎮 GPU Basics

GPU vs CPU Architecture

Latency vs Throughput

CPU (Latency-Optimized)

GPU (Throughput-Optimized)

CPU vs GPU Die Area Allocation

CUDA Cores & Streaming Multiprocessors

GPU Building Blocks

Streaming Multiprocessor (SM) Internals

SIMT Execution Model

SIMT vs SIMD

Warp Execution

SIMT Execution Flow

GPU Memory Hierarchy

Global Memory

Shared Memory

Registers

Global Memory

Memory Access Patterns

Parallel Processing on GPU

Massive Parallelism

Code Example: Matrix Multiplication

GPU vs CPU: When to Use Each

Interview Questions