Data ScienceMachine Learning

Machine Learning

Supervised, unsupervised, and interactive KNN demo.

Supervised Learning

Models learn from labeled data (input → output pairs). Goal: predict labels for new data.

Regression:

Predict continuous values (price, temperature)

Classification:

Predict discrete classes (spam/not spam, cat/dog)

Unsupervised Learning

Models find patterns in unlabeled data. No ground-truth outputs provided.

Clustering:

Group similar data points (K-Means, DBSCAN)

Dimensionality Reduction:

Compress data while preserving structure (PCA, t-SNE)

Train/Test Split & Cross-Validation

Always split data into training and test sets to evaluate generalization. Cross-validation (k-fold) provides more robust performance estimates by rotating the validation fold.

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

data = load_iris()
X, y = data.data, data.target

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f"Test accuracy: {acc:.3f}")

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"CV scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")

Interactive KNN Classifier

k =3

Click on the chart to classify a point using KNN!

Legend

Red (class A)

Blue (class B)

Prediction

Click anywhere on the chart. The KNN algorithm finds the k nearest training points and predicts the majority class. Try adding more points and see how the decision boundary changes!

Interview Questions

Q: What is the difference between supervised and unsupervised learning?

Supervised learning uses labeled data (input-output pairs) to predict outputs for new inputs. Unsupervised learning finds patterns in unlabeled data without predefined outputs.

Q: What is overfitting and how do you prevent it?

Overfitting occurs when a model learns training data noise instead of the underlying pattern. Prevention: more data, feature reduction, regularization (L1/L2), cross-validation, early stopping, simpler models.

Q: Explain the bias-variance tradeoff.

Bias is error from underfitting (model too simple). Variance is error from overfitting (model too complex). The tradeoff: increasing model complexity reduces bias but increases variance. The goal is to minimize total error.

Q: What is k-fold cross-validation?

Data is split into k equal folds. The model trains on k-1 folds and validates on the remaining fold, rotating k times. The final score is the average across all folds—more reliable than a single train/test split.