LLMs & GPT

Large language models, fine-tuning, RAG, and prompt engineering.

What Are LLMs?

Large Language Models (LLMs) are neural networks with billions of parameters trained on massive text corpora. They predict the next token in a sequence, enabling them to generate coherent text, answer questions, translate, summarize, and reason.

GPT-4

OpenAI

1.8T params

Claude

Anthropic

~500B params

LLaMA 3

GPT Architecture

GPT uses a decoder-only transformer architecture. It is autoregressive: it generates one token at a time, conditioning each new token on all previous tokens via masked self-attention.

# GPT-style architecture (conceptual)
# Decoder-only transformer block

class GPTBlock:
    def __init__(self, d_model=768, num_heads=12):
        self.attention = MultiHeadAttention(num_heads, d_model)
        self.norm1 = LayerNorm(d_model)
        self.ffn = Sequential([
            Dense(d_model * 4, activation="gelu"),
            Dense(d_model)
        ])
        self.norm2 = LayerNorm(d_model)

    def __call__(self, x, mask):
        # Masked self-attention (can't look ahead)
        attn_out = self.attention(x, x, x, mask)
        x = self.norm1(x + attn_out)  # residual + norm

        # Feed-forward
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)   # residual + norm
        return x

# Key differences from BERT:
# - Decoder-only (vs encoder-only)
# - Causal masking (vs bidirectional)
# - Autoregressive generation (vs span fill)

Fine-Tuning & RAG

Fine-tuning adapts a pre-trained LLM to a specific task using labeled data. RAG (Retrieval-Augmented Generation) combines LLMs with external knowledge retrieval to ground responses in factual data without retraining.

Fine-Tuning

• Train on task-specific labeled data
• Full fine-tuning: update all parameters
• LoRA: train small adapter matrices (efficient)
• QLoRA: quantize + LoRA for consumer GPUs
• Use when: task-specific style/format needed

RAG (Retrieval-Augmented Generation)

• Retrieve relevant docs from vector DB
• Inject into LLM context as extra knowledge
• No training required — data stays external
• Use when: factual accuracy & freshness needed
• Reduces hallucinations dramatically

# Fine-tuning with LoRA (using HuggingFace PEFT)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

lora_config = LoraConfig(
    r=8,                     # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05
)
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.num_parameters(only_trainable=True):,}")

# RAG pipeline (conceptual)
# 1. Embed query → search vector database
# 2. Retrieve top-k relevant chunks
# 3. Build prompt: context + query
# 4. Generate response with LLM

Prompt Engineering Basics

Prompt engineering is the practice of crafting inputs to get desired outputs from LLMs. Key techniques improve reliability, accuracy, and control.

Zero-Shot

Classify this review: 'Great product!'
Sentiment:

No examples needed

Few-Shot

Positive: 'Love it'
Negative: 'Terrible'
Review: 'Okay' →

Provide 2-3 examples

Chain-of-Thought

Q: 24 × 37 = ?
Let's think step by step...

Reason step-by-step

Interview Questions

Q: What is the difference between a base LLM and an instruction-tuned LLM?

Base LLMs are pre-trained on raw text to predict the next token. Instruction-tuned models are further fine-tuned on instruction-response pairs (often with RLHF) to follow user instructions and produce helpful, safe outputs.

Q: What is RAG and when would you use it over fine-tuning?

RAG retrieves external knowledge and injects it into the LLM context. Use RAG when you need up-to-date information, reduce hallucinations, or access proprietary data without retraining. Fine-tuning is better for task-specific style or behavior.

Q: Explain the transformer attention mechanism in GPT.

GPT uses masked (causal) self-attention: each token can only attend to itself and prior tokens. The attention scores determine how much each previous token contributes to the current token's representation. This enables autoregressive generation.

Q: What are some prompt engineering best practices?

Be specific and clear, provide examples (few-shot), use chain-of-thought for reasoning, split complex tasks into steps, set output format (JSON, bullet points), and iterate by refining based on responses.