Statistics for Data Science
Foundational statistical concepts for data analysis and ML.
Descriptive Statistics
Descriptive statistics summarize data through measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation, IQR).
import numpy as np
from scipy import stats
data = [12, 15, 14, 10, 18, 20, 14, 16, 13, 15]
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, keepdims=True)
variance = np.var(data, ddof=1) # sample variance
std = np.std(data, ddof=1)
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
print(f"Mean: {mean:.2f}")
print(f"Median: {median}")
print(f"Mode: {mode.mode[0]}")
print(f"Std: {std:.2f}")
print(f"IQR: {iqr:.2f}")
# Skewness & Kurtosis
print(f"Skew: {stats.skew(data):.2f}")
print(f"Kurtosis: {stats.kurtosis(data):.2f}")Probability Distributions
Distributions describe how probabilities are assigned to outcomes. Key distributions: Normal (bell curve), Binomial (yes/no trials), Poisson (event counts), and Uniform.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Normal distribution
mu, sigma = 0, 1
norm_dist = stats.norm(mu, sigma)
x = np.linspace(-4, 4, 100)
pdf = norm_dist.pdf(x) # probability density
cdf = norm_dist.cdf(x) # cumulative prob
# Binomial distribution
n, p = 10, 0.5
binom_dist = stats.binom(n, p)
prob_6 = binom_dist.pmf(6) # P(X = 6)
# Poisson distribution
lam = 3
poisson_dist = stats.poisson(lam)
prob_5 = poisson_dist.pmf(5) # P(X = 5) events
# Percentiles (z-score)
z = norm_dist.ppf(0.975) # 1.96 — 95% confidence
print(f"95% CI z-score: {z:.3f}")
# Random sampling
samples = norm_dist.rvs(1000, random_state=42)
print(f"Sample mean: {samples.mean():.3f}")Hypothesis Testing
Hypothesis testing determines whether observed effects are statistically significant. Common tests: t-test (compare means), chi-squared (categorical), ANOVA (multiple groups).
from scipy import stats
import numpy as np
# One-sample t-test: is mean different from 100?
data = [98, 102, 97, 105, 99, 101, 96, 103]
t_stat, p_val = stats.ttest_1samp(data, 100)
print(f"t = {t_stat:.3f}, p = {p_val:.4f}")
# p < 0.05 → reject null hypothesis
# Two-sample t-test
group_a = np.random.normal(100, 10, 30)
group_b = np.random.normal(105, 10, 30)
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"Two-sample t = {t_stat:.3f}, p = {p_val:.4f}")
# Chi-squared test
observed = [[10, 20], [15, 25]]
chi2, p, dof, expected = stats.chi2_contingency(observed)
print(f"Chi2 = {chi2:.3f}, p = {p:.4f}")
# p-value interpretation:
# p < 0.05 → statistically significant (95% confidence)
# p < 0.01 → highly significant (99% confidence)
# p > 0.05 → not significantCorrelation vs Causation
Correlation measures the strength of a linear relationship between variables (Pearson's r, range -1 to 1). Causation implies one variable directly affects another. Correlation does NOT imply causation.
import numpy as np
from scipy import stats
# Pearson correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 6])
r, p_val = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p_val:.4f}")
# Spearman rank correlation (non-parametric)
rho, p = stats.spearmanr(x, y)
print(f"Spearman rho = {rho:.3f}")
# Common pitfalls:
# - Spurious correlation: ice cream sales ← temp → drownings
# - Confounding variable: Z causes both X and Y
# - Reversed causation: does X cause Y or Y cause X?
# - Selection bias: cherry-picking data pointsCorrelation
Two variables move together
Causation
One causes the other
Pitfall
Confusing the two
Interview Questions
Q: What is the difference between descriptive and inferential statistics?
Descriptive statistics summarize data (mean, median, std). Inferential statistics draw conclusions about populations from samples using hypothesis tests, confidence intervals, and regression.
Q: Explain p-value and significance level.
A p-value is the probability of observing the data (or more extreme) assuming the null hypothesis is true. A significance level α (typically 0.05) is the threshold: p < α means reject the null.
Q: What is the Central Limit Theorem?
CLT states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population's distribution. This justifies many parametric tests.
Q: Why does correlation not imply causation?
Confounding variables, reversed causation, and spurious correlations can produce high correlation without a causal link. Establishing causation requires controlled experiments (RCTs) or causal inference methods.