Python for Data Science
Core Python concepts every data scientist must know.
Data Types & Structures
Python provides built-in data types that form the foundation of data manipulation. Lists hold ordered mutable sequences, dicts map keys to values for fast lookups, and sets store unique elements with O(1) membership tests.
# Lists, Dicts, and Sets
numbers = [1, 2, 3, 4, 5]
person = {"name": "Alice", "age": 30}
unique = {1, 2, 3, 3, 2} # {1, 2, 3}
# List operations
numbers.append(6)
squared = [x**2 for x in numbers if x > 2]
print(squared) # [9, 16, 25, 36]
# Dict comprehension
squares = {x: x**2 for x in range(5)}
print(squares) # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
# Set operations
A, B = {1, 2, 3}, {3, 4, 5}
print(A & B) # {3} intersection
print(A | B) # {1, 2, 3, 4, 5} unionList Comprehensions & Generators
List comprehensions provide concise syntax for transforming sequences. Generators use lazy evaluation—they yield items one at a time, making them memory-efficient for large datasets.
# List comprehension vs loop
squares = [x**2 for x in range(10)]
# Same as: squares = []
# for x in range(10): squares.append(x**2)
# Generator expression (lazy)
gen = (x**2 for x in range(10**6))
print(next(gen)) # 0 — no memory blowup
# Generator function
def fibonacci(limit):
a, b = 0, 1
while a < limit:
yield a
a, b = b, a + b
fib = list(fibonacci(100))
print(fib) # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]NumPy & Pandas Overview
NumPy delivers fast array operations and linear algebra. Pandas adds DataFrame objects for tabular data analysis. Together they form the backbone of the Python data science stack.
import numpy as np
import pandas as pd
# NumPy array
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape) # (2, 3)
print(arr.mean()) # 3.5
print(arr.sum(axis=0)) # [5, 7, 9]
# Pandas DataFrame
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"score": [85, 92, 78]
})
print(df.describe())
print(df[df["score"] > 80])Data Science Libraries
NumPy
Numerical computing, arrays, linear algebra
Pandas
DataFrames, data cleaning, transformation
Matplotlib
Static, animated, interactive visualizations
Seaborn
Statistical data visualization (built on Matplotlib)
Scikit-learn
ML algorithms: regression, classification, clustering
TensorFlow / PyTorch
Deep learning frameworks for neural networks
SciPy
Scientific computing, optimization, signal processing
NLTK / spaCy
Natural language processing toolkit
Interview Questions
Q: What is the difference between a list and a tuple?
Lists are mutable (can be changed after creation) while tuples are immutable. Lists use square brackets [], tuples use parentheses (). Tuples are hashable and can be used as dictionary keys.
Q: Explain list comprehension with an example.
List comprehension is a concise way to create lists: `[x**2 for x in range(10) if x % 2 == 0]` generates squares of even numbers 0–9. It is generally faster than a manual for-loop.
Q: What is a generator and when would you use one?
A generator yields items lazily using the `yield` keyword, producing one value at a time without storing the entire sequence in memory. Use generators for large datasets or infinite sequences to avoid memory blowup.
Q: How do NumPy arrays differ from Python lists?
NumPy arrays are homogeneous (all elements same type), stored contiguously in memory, and support vectorized operations. They are significantly faster for numerical operations than Python lists, which store pointers to objects.