Natural Language Processing

Text preprocessing, embeddings, and modern NLP with transformers.

Text Preprocessing

Raw text is messy. Preprocessing steps clean and normalize text before feeding into models: lowercasing, removing punctuation/tags, stemming/lemmatization, and removing stop words.

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = "The cats are running quickly! <br> They're playing."

# 1. Lowercase
text = text.lower()

# 2. Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)

# 3. Remove punctuation & digits
text = re.sub(r'[^\w\s]', '', text)

# 4. Tokenize
tokens = nltk.word_tokenize(text)
print(f"Tokens: {tokens}")

# 5. Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words]

# 6. Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(t) for t in tokens]
print(f"Stems: {stems}")

# 7. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in tokens]
print(f"Lemmas: {lemmas}")

Tokenization & Embeddings

Tokenization splits text into tokens (words, subwords, or characters). Embeddings map tokens to dense vector representations that capture semantic meaning. Word2Vec, GloVe, and BERT embeddings are common.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Word-level tokenization
texts = ["I love NLP", "Deep learning is amazing", "I love transformers"]
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
print(f"Sequences: {sequences}")
print(f"Word index: {tokenizer.word_index}")

# Padding to equal length
padded = pad_sequences(sequences, maxlen=5, padding="post")
print(f"Padded: {padded}")

# Simple embedding layer
from tensorflow.keras.layers import Embedding
embedding = Embedding(input_dim=10000, output_dim=128, input_length=5)
# This learns dense 128-dim vectors for each token

# Pre-trained embeddings (GloVe)
# Load glove.6B.100d.txt → create embedding matrix
# embedding_layer.set_weights([embedding_matrix])

Transformers & BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer model that understands context from both left and right. It set new benchmarks on 11 NLP tasks and powers modern NLP pipelines.

from transformers import pipeline

# Sentiment analysis with pre-trained transformer
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

result = classifier("I absolutely loved this movie!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

result = classifier("The service was terrible and slow.")
print(result)
# [{'label': 'NEGATIVE', 'score': 0.9987}]

# Feature extraction (embeddings)
extractor = pipeline("feature-extraction", model="bert-base-uncased")
features = extractor("Transformers are powerful.")
print(f"Shape: {len(features[0])} tokens x {len(features[0][0])} dims")

Build: Sentiment Classifier

A complete text classification pipeline using TF-IDF features and a logistic regression model — lightweight and interpretable.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

texts = [
    "This product is amazing!", "Worst purchase ever.",
    "Really happy with this", "Not worth the money",
    "Excellent quality", "Terrible customer service",
    "I love it", "Very disappointed",
    "Highly recommend", "Do not buy this"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42
)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=1000)),
    ("clf", LogisticRegression())
])

pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Test accuracy: {accuracy:.2f}")

# Predict new text
new = ["This is fantastic!"]
pred = pipeline.predict(new)
print(f"Prediction: {'positive' if pred[0] else 'negative'}")

Interview Questions

Q: What is the difference between stemming and lemmatization?

Stemming crudely chops prefixes/suffixes (e.g., 'running' → 'run'). Lemmatization uses dictionaries to return the base dictionary form (e.g., 'better' → 'good'). Lemmatization is more accurate but slower.

Q: What are word embeddings and why are they used?

Word embeddings are dense vector representations of words where semantically similar words are close in vector space (e.g., king - man + woman ≈ queen). They capture meaning better than sparse bag-of-words representations.

Q: How does a transformer differ from an RNN/LSTM?

Transformers process all tokens in parallel using self-attention, avoiding the sequential bottleneck of RNNs. This enables much faster training and better handling of long-range dependencies, which is why transformers dominate modern NLP.

Q: What is BERT and how is it fine-tuned?

BERT is a bidirectional transformer pre-trained on masked language modeling and next-sentence prediction. Fine-tuning adds a task-specific head (e.g., classification) and trains on labeled data while updating all or some parameters.