Python for Data Science

Core Python concepts every data scientist must know.

Data Types & Structures

Python provides built-in data types that form the foundation of data manipulation. Lists hold ordered mutable sequences, dicts map keys to values for fast lookups, and sets store unique elements with O(1) membership tests.

# Lists, Dicts, and Sets
numbers = [1, 2, 3, 4, 5]
person = {"name": "Alice", "age": 30}
unique = {1, 2, 3, 3, 2}  # {1, 2, 3}

# List operations
numbers.append(6)
squared = [x**2 for x in numbers if x > 2]
print(squared)  # [9, 16, 25, 36]

# Dict comprehension
squares = {x: x**2 for x in range(5)}
print(squares)  # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

# Set operations
A, B = {1, 2, 3}, {3, 4, 5}
print(A & B)  # {3} intersection
print(A | B)  # {1, 2, 3, 4, 5} union

List Comprehensions & Generators

List comprehensions provide concise syntax for transforming sequences. Generators use lazy evaluation—they yield items one at a time, making them memory-efficient for large datasets.

# List comprehension vs loop
squares = [x**2 for x in range(10)]
# Same as: squares = []
#          for x in range(10): squares.append(x**2)

# Generator expression (lazy)
gen = (x**2 for x in range(10**6))
print(next(gen))  # 0 — no memory blowup

# Generator function
def fibonacci(limit):
    a, b = 0, 1
    while a < limit:
        yield a
        a, b = b, a + b

fib = list(fibonacci(100))
print(fib)  # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

NumPy & Pandas Overview

NumPy delivers fast array operations and linear algebra. Pandas adds DataFrame objects for tabular data analysis. Together they form the backbone of the Python data science stack.

import numpy as np
import pandas as pd

# NumPy array
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape)     # (2, 3)
print(arr.mean())    # 3.5
print(arr.sum(axis=0))  # [5, 7, 9]

# Pandas DataFrame
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "score": [85, 92, 78]
})
print(df.describe())
print(df[df["score"] > 80])

Data Science Libraries

NumPy

Numerical computing, arrays, linear algebra

Pandas

DataFrames, data cleaning, transformation

Matplotlib

Static, animated, interactive visualizations

Seaborn

Statistical data visualization (built on Matplotlib)

Scikit-learn

ML algorithms: regression, classification, clustering

TensorFlow / PyTorch

Deep learning frameworks for neural networks

SciPy

Scientific computing, optimization, signal processing

NLTK / spaCy

Natural language processing toolkit

Interview Questions

Q: What is the difference between a list and a tuple?

Lists are mutable (can be changed after creation) while tuples are immutable. Lists use square brackets [], tuples use parentheses (). Tuples are hashable and can be used as dictionary keys.

Q: Explain list comprehension with an example.

List comprehension is a concise way to create lists: `[x**2 for x in range(10) if x % 2 == 0]` generates squares of even numbers 0–9. It is generally faster than a manual for-loop.

Q: What is a generator and when would you use one?

A generator yields items lazily using the `yield` keyword, producing one value at a time without storing the entire sequence in memory. Use generators for large datasets or infinite sequences to avoid memory blowup.

Q: How do NumPy arrays differ from Python lists?

NumPy arrays are homogeneous (all elements same type), stored contiguously in memory, and support vectorized operations. They are significantly faster for numerical operations than Python lists, which store pointers to objects.