LLMs & GPT
Large language models, fine-tuning, RAG, and prompt engineering.
What Are LLMs?
Large Language Models (LLMs) are neural networks with billions of parameters trained on massive text corpora. They predict the next token in a sequence, enabling them to generate coherent text, answer questions, translate, summarize, and reason.
GPT-4
OpenAI
1.8T params
Claude
Anthropic
~500B params
LLaMA 3
Meta
405B params
Gemini
Google DeepMind
Multimodal
GPT Architecture
GPT uses a decoder-only transformer architecture. It is autoregressive: it generates one token at a time, conditioning each new token on all previous tokens via masked self-attention.
# GPT-style architecture (conceptual)
# Decoder-only transformer block
class GPTBlock:
def __init__(self, d_model=768, num_heads=12):
self.attention = MultiHeadAttention(num_heads, d_model)
self.norm1 = LayerNorm(d_model)
self.ffn = Sequential([
Dense(d_model * 4, activation="gelu"),
Dense(d_model)
])
self.norm2 = LayerNorm(d_model)
def __call__(self, x, mask):
# Masked self-attention (can't look ahead)
attn_out = self.attention(x, x, x, mask)
x = self.norm1(x + attn_out) # residual + norm
# Feed-forward
ffn_out = self.ffn(x)
x = self.norm2(x + ffn_out) # residual + norm
return x
# Key differences from BERT:
# - Decoder-only (vs encoder-only)
# - Causal masking (vs bidirectional)
# - Autoregressive generation (vs span fill)Fine-Tuning & RAG
Fine-tuning adapts a pre-trained LLM to a specific task using labeled data. RAG (Retrieval-Augmented Generation) combines LLMs with external knowledge retrieval to ground responses in factual data without retraining.
Fine-Tuning
- • Train on task-specific labeled data
- • Full fine-tuning: update all parameters
- • LoRA: train small adapter matrices (efficient)
- • QLoRA: quantize + LoRA for consumer GPUs
- • Use when: task-specific style/format needed
RAG (Retrieval-Augmented Generation)
- • Retrieve relevant docs from vector DB
- • Inject into LLM context as extra knowledge
- • No training required — data stays external
- • Use when: factual accuracy & freshness needed
- • Reduces hallucinations dramatically
# Fine-tuning with LoRA (using HuggingFace PEFT)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
lora_config = LoraConfig(
r=8, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05
)
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.num_parameters(only_trainable=True):,}")
# RAG pipeline (conceptual)
# 1. Embed query → search vector database
# 2. Retrieve top-k relevant chunks
# 3. Build prompt: context + query
# 4. Generate response with LLMPrompt Engineering Basics
Prompt engineering is the practice of crafting inputs to get desired outputs from LLMs. Key techniques improve reliability, accuracy, and control.
Zero-Shot
Classify this review: 'Great product!'
Sentiment:No examples needed
Few-Shot
Positive: 'Love it'
Negative: 'Terrible'
Review: 'Okay' →Provide 2-3 examples
Chain-of-Thought
Q: 24 × 37 = ?
Let's think step by step...Reason step-by-step
Interview Questions
Q: What is the difference between a base LLM and an instruction-tuned LLM?
Base LLMs are pre-trained on raw text to predict the next token. Instruction-tuned models are further fine-tuned on instruction-response pairs (often with RLHF) to follow user instructions and produce helpful, safe outputs.
Q: What is RAG and when would you use it over fine-tuning?
RAG retrieves external knowledge and injects it into the LLM context. Use RAG when you need up-to-date information, reduce hallucinations, or access proprietary data without retraining. Fine-tuning is better for task-specific style or behavior.
Q: Explain the transformer attention mechanism in GPT.
GPT uses masked (causal) self-attention: each token can only attend to itself and prior tokens. The attention scores determine how much each previous token contributes to the current token's representation. This enables autoregressive generation.
Q: What are some prompt engineering best practices?
Be specific and clear, provide examples (few-shot), use chain-of-thought for reasoning, split complex tasks into steps, set output format (JSON, bullet points), and iterate by refining based on responses.