CampusFlow

Pandas

Data manipulation and analysis with DataFrames.

DataFrames & Series

A Series is a labeled 1-D array. A DataFrame is a 2-D table with rows and columns, the primary Pandas data structure. Both are built on NumPy arrays.

import pandas as pd
import numpy as np

# Series
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s["b"])  # 20

# DataFrame
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "salary": [70000, 80000, 120000, 90000]
})
print(df.head())
print(df.dtypes)
print(df.describe())

Reading & Writing Data

Pandas can read and write from CSV, Excel, SQL databases, JSON, Parquet, and many other formats with a consistent API.

import pandas as pd

# CSV (most common)
df = pd.read_csv("data.csv")
df.to_csv("clean_data.csv", index=False)

# Excel (needs openpyxl or xlrd)
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
df.to_excel("output.xlsx", sheet_name="Results", index=False)

# SQL (needs SQLAlchemy)
from sqlalchemy import create_engine
engine = create_engine("sqlite:///database.db")
df = pd.read_sql("SELECT * FROM orders", engine)
df.to_sql("orders_clean", engine, if_exists="replace")

# JSON
df = pd.read_json("data.json")
df.to_json("output.json", orient="records")

Data Cleaning & Missing Values

Real-world data is messy. Pandas provides tools to detect and handle missing values, remove duplicates, fix data types, and transform columns.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "A": [1, 2, np.nan, 4, 5],
    "B": [np.nan, 2, 3, np.nan, 5],
    "C": ["x", "y", "y", "x", "y"]
})

# Detect missing
print(df.isnull().sum())

# Drop missing
df_dropped = df.dropna()

# Fill missing
df_filled = df.fillna({
    "A": df["A"].mean(),
    "B": 0
})

# Forward fill
df_ffill = df.fillna(method="ffill")

# Duplicates
df[df.duplicated(subset=["C"], keep=False)]
df_clean = df.drop_duplicates()

# Type conversion
df["A"] = df["A"].astype("int32")

GroupBy, Merge & Join

GroupBy enables split-apply-combine operations for aggregation. Merge and Join combine multiple DataFrames on keys, similar to SQL JOINs.

import pandas as pd

# GroupBy
df = pd.DataFrame({
    "department": ["Sales", "Eng", "Sales", "Eng", "HR"],
    "salary": [70000, 95000, 80000, 110000, 65000]
})
grouped = df.groupby("department")["salary"]
print(grouped.mean())
print(grouped.agg(["mean", "std", "count"]))

# Merge (SQL-style joins)
orders = pd.DataFrame({
    "order_id": [1, 2, 3],
    "customer_id": [101, 102, 101]
})
customers = pd.DataFrame({
    "customer_id": [101, 102, 103],
    "name": ["Alice", "Bob", "Charlie"]
})
merged = pd.merge(orders, customers, on="customer_id", how="inner")
print(merged)

# Concat
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"A": [3, 4]})
combined = pd.concat([df1, df2], ignore_index=True)

Interview Questions

Q: What is the difference between a Series and a DataFrame?

A Series is a 1-D labeled array (like a single column). A DataFrame is a 2-D labeled data structure with rows and columns, essentially a collection of Series sharing the same index.

Q: How do you handle missing values in Pandas?

Use `df.isnull()` to detect, `df.dropna()` to remove, and `df.fillna(value)` to impute. Strategy depends on context: drop rows with too many missing, fill with mean/median for numeric, or forward-fill for time series.

Q: Explain the difference between merge, join, and concat.

`merge` combines DataFrames on columns (SQL-style). `join` combines on indices. `concat` stacks DataFrames along rows or columns without requiring a key.

Q: What is the GroupBy operation and how does it work?

GroupBy follows split-apply-combine: split data into groups based on a key, apply a function (aggregation, transformation, or filtering) to each group, and combine the results into a new DataFrame.