Machine Learning
Supervised, unsupervised, and interactive KNN demo.
Supervised Learning
Models learn from labeled data (input → output pairs). Goal: predict labels for new data.
Regression:
Predict continuous values (price, temperature)
Classification:
Predict discrete classes (spam/not spam, cat/dog)
Unsupervised Learning
Models find patterns in unlabeled data. No ground-truth outputs provided.
Clustering:
Group similar data points (K-Means, DBSCAN)
Dimensionality Reduction:
Compress data while preserving structure (PCA, t-SNE)
Train/Test Split & Cross-Validation
Always split data into training and test sets to evaluate generalization. Cross-validation (k-fold) provides more robust performance estimates by rotating the validation fold.
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
data = load_iris()
X, y = data.data, data.target
# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f"Test accuracy: {acc:.3f}")
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"CV scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")Interactive KNN Classifier
Click on the chart to classify a point using KNN!
Legend
Click anywhere on the chart. The KNN algorithm finds the k nearest training points and predicts the majority class. Try adding more points and see how the decision boundary changes!
Interview Questions
Q: What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data (input-output pairs) to predict outputs for new inputs. Unsupervised learning finds patterns in unlabeled data without predefined outputs.
Q: What is overfitting and how do you prevent it?
Overfitting occurs when a model learns training data noise instead of the underlying pattern. Prevention: more data, feature reduction, regularization (L1/L2), cross-validation, early stopping, simpler models.
Q: Explain the bias-variance tradeoff.
Bias is error from underfitting (model too simple). Variance is error from overfitting (model too complex). The tradeoff: increasing model complexity reduces bias but increases variance. The goal is to minimize total error.
Q: What is k-fold cross-validation?
Data is split into k equal folds. The model trains on k-1 folds and validates on the remaining fold, rotating k times. The final score is the average across all folds—more reliable than a single train/test split.