SQL for Data Science

Query, transform, and analyze data with SQL.

SELECT, WHERE & GROUP BY

SELECT retrieves data from tables. WHERE filters rows before aggregation. GROUP BY groups rows sharing a value so aggregate functions (COUNT, SUM, AVG) can be applied per group. HAVING filters groups after aggregation.

-- Basic SELECT
SELECT name, salary
FROM employees
WHERE department = 'Engineering'
ORDER BY salary DESC
LIMIT 10;

-- Aggregation with GROUP BY
SELECT
    department,
    COUNT(*) AS employee_count,
    ROUND(AVG(salary), 2) AS avg_salary,
    MAX(salary) AS max_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_salary DESC;

-- DISTINCT values
SELECT DISTINCT department FROM employees;

-- Date functions
SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(amount) AS total_revenue
FROM orders
GROUP BY month
ORDER BY month;

JOINs

JOINs combine rows from multiple tables based on a related column. INNER JOIN keeps matching rows only. LEFT/RIGHT JOIN keeps all rows from one side. FULL OUTER JOIN keeps all rows from both sides.

-- INNER JOIN: only matching customers
SELECT
    o.order_id,
    o.amount,
    c.name,
    c.email
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;

-- LEFT JOIN: all customers, even without orders
SELECT
    c.name,
    COUNT(o.id) AS order_count,
    COALESCE(SUM(o.amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name;

-- Self JOIN: employees and their managers
SELECT
    e.name AS employee,
    m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

-- Multiple JOINs
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id;

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing them into a single output row. ROW_NUMBER, RANK, LAG, LEAD, and running totals are common use cases.

-- ROW_NUMBER: rank employees by salary per department
SELECT
    name,
    department,
    salary,
    ROW_NUMBER() OVER (
        PARTITION BY department
        ORDER BY salary DESC
    ) AS salary_rank
FROM employees;

-- Running total of sales
SELECT
    order_date,
    amount,
    SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;

-- LAG: compare current to previous month
SELECT
    month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
    revenue - LAG(revenue, 1) OVER (ORDER BY month) AS growth
FROM monthly_revenue;

-- RANK vs DENSE_RANK
SELECT
    score,
    RANK() OVER (ORDER BY score DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM exam_scores;

Subqueries & CTEs

Subqueries are queries nested inside another query. CTEs (WITH clause) name a subquery for readability and allow recursion. Both make complex queries cleaner and more maintainable.

-- Subquery in WHERE
SELECT name, salary
FROM employees
WHERE salary > (
    SELECT AVG(salary) FROM employees
);

-- Subquery in SELECT
SELECT
    name,
    salary,
    (SELECT AVG(salary) FROM employees) AS company_avg,
    salary - (SELECT AVG(salary) FROM employees) AS diff
FROM employees;

-- CTE (Common Table Expression)
WITH dept_avg AS (
    SELECT
        department,
        AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
)
SELECT
    e.name,
    e.salary,
    d.avg_salary,
    e.salary - d.avg_salary AS above_avg
FROM employees e
JOIN dept_avg d ON e.department = d.department;

-- Recursive CTE (hierarchy)
WITH RECURSIVE org_tree AS (
    SELECT id, name, manager_id, 1 AS level
    FROM employees WHERE manager_id IS NULL
    UNION ALL
    SELECT e.id, e.name, e.manager_id, t.level + 1
    FROM employees e
    JOIN org_tree t ON e.manager_id = t.id
)
SELECT * FROM org_tree;

Database Design Basics

Normalization

Eliminate redundancy (1NF, 2NF, 3NF). Split data into related tables to avoid update anomalies.

Indexes

Speed up queries by creating B-tree or hash indexes on frequently filtered columns. Trade-off: slower writes.

Primary & Foreign Keys

PK uniquely identifies a row. FK references a PK in another table, enforcing referential integrity.

ER Diagrams

Entity-Relationship diagrams model the logical structure: entities (tables), attributes (columns), and relationships (joins).

ACID Transactions

Atomicity, Consistency, Isolation, Durability — guarantees for reliable data operations.

Denormalization

Intentionally adding redundancy for read performance (common in data warehouses and analytics).

Interview Questions

Q: What is the difference between WHERE and HAVING?

WHERE filters rows before aggregation. HAVING filters groups after aggregation (used with GROUP BY). You can use WHERE on individual rows but not on aggregate results like COUNT(*).

Q: Explain INNER JOIN vs LEFT JOIN with an example.

INNER JOIN returns only rows with matching keys in both tables. LEFT JOIN returns all rows from the left table and matching rows from the right, with NULLs where no match exists.

Q: What is a window function and give an example?

A window function performs a calculation across a set of rows related to the current row without collapsing them. Example: `ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)` ranks employees within each department.

Q: What is the difference between a CTE and a subquery?

Both are temporary result sets. CTEs (WITH clause) are more readable, can be referenced multiple times in the same query, and support recursion. Subqueries are typically inlined in WHERE/SELECT clauses.