Big Data

Hadoop, Spark, MapReduce, and data pipelines at scale.

Hadoop Ecosystem

Apache Hadoop is the foundation of big data processing. It provides distributed storage (HDFS) and distributed computation (MapReduce) across clusters of commodity hardware.

HDFS

Distributed file system

Stores files in blocks (128 MB) across nodes with replication (default 3x)

MapReduce

Compute framework

Processes data in two phases: Map (filter/sort) and Reduce (aggregate/summarize)

YARN

Resource manager

Manages cluster resources, schedules jobs, allocates CPU/memory per application

Hive

SQL on Hadoop

SQL-like interface (HiveQL) that compiles queries into MapReduce/Tez jobs

HBase

NoSQL database

Column-oriented, real-time random read/write on top of HDFS

Oozie

Workflow scheduler

Orchestrates dependent jobs (MapReduce, Pig, Hive) in DAGs

ZooKeeper

Coordination service

Distributed configuration, synchronization, and leader election

Sqoop

Data transfer

Imports/exports data between Hadoop and relational databases

MapReduce Concept

MapReduce splits a job into map and reduce phases. The map phase processes input key-value pairs in parallel across nodes. The reduce phase aggregates intermediate results. Shuffling sorts and transfers data between phases.

# Word Count — the Hello World of MapReduce
# Map phase: emit (word, 1) for each word
def mapper(line):
    words = line.strip().split()
    for word in words:
        print(f"{word.lower()}\t1")

# Reduce phase: sum counts for each word
def reducer(word, counts):
    total = sum(counts)
    print(f"{word}\t{total}")

# Input: "hello world hello hadoop"
# Map outputs:
#   hello  1
#   world  1
#   hello  1
#   hadoop 1
# Shuffle & sort:
#   hadoop [1]
#   hello  [1, 1]
#   world  [1]
# Reduce outputs:
#   hadoop  1
#   hello   2
#   world   1

Apache Spark

Spark is a fast, in-memory distributed computing engine that outperforms MapReduce by 10-100x for many workloads. It provides high-level APIs in Python (PySpark), Scala, and SQL.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, when

# Initialize Spark
spark = SparkSession.builder \
    .appName("DataScience") \
    .getOrCreate()

# Read data
df = spark.read.csv("s3://bucket/sales.csv", header=True, inferSchema=True)
print(f"Partitions: {df.rdd.getNumPartitions()}")
print(f"Schema: {df.printSchema()}")

# Transformations (lazy — only computed on action)
daily_sales = df.groupBy("date").agg(
    count("*").alias("order_count"),
    avg("amount").alias("avg_amount"),
    sum("amount").alias("total_revenue")
)

# Action triggers computation
daily_sales.show(10)
daily_sales.write.parquet("output/daily_sales.parquet")

# Spark SQL
df.createOrReplaceTempView("sales")
result = spark.sql("""
    SELECT
        region,
        SUM(amount) as revenue
    FROM sales
    WHERE year = 2024
    GROUP BY region
    ORDER BY revenue DESC
""")
result.show()

# MLlib — distributed ML
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans

Data Pipelines & ETL

ETL (Extract, Transform, Load) pipelines move data from sources to data warehouses or lakes. Modern tools like Airflow, dbt, and Spark Structure Streaming enable reliable, scalable pipeline orchestration.

Extract

Pull data from APIs, databases, logs, SaaS tools, or streaming platforms (Kafka, Kinesis)

Transform

Clean, validate, aggregate, join, and enrich data. Apply business logic and schema changes.

Load

Write to data warehouse (Snowflake, Redshift, BigQuery), data lake (S3, ADLS), or analytics store

# Apache Airflow DAG (simplified)
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract(**ctx):
    # Read from API / database
    pass

def transform(**ctx):
    # Clean and aggregate
    pass

def load(**ctx):
    # Write to warehouse
    pass

with DAG(
    "sales_etl",
    start_date=datetime(2024, 1, 1),
    schedule="@daily",
    catchup=False
):
    extract_task = PythonOperator(
        task_id="extract", python_callable=extract
    )
    transform_task = PythonOperator(
        task_id="transform", python_callable=transform
    )
    load_task = PythonOperator(
        task_id="load", python_callable=load
    )

    extract_task >> transform_task >> load_task

Interview Questions

Q: What is the difference between Hadoop MapReduce and Spark?

MapReduce writes intermediate data to disk between map and reduce phases, making it slower for iterative algorithms. Spark processes data in memory (RDDs/DataFrames), achieving 10-100x speedup. Spark also provides richer APIs (SQL, ML, streaming).

Q: Explain HDFS architecture.

HDFS has a NameNode (master) that manages metadata and DataNodes (workers) that store data blocks. Files are split into blocks (default 128 MB) and replicated across DataNodes (default 3x) for fault tolerance.

Q: What is the difference between a data lake and a data warehouse?

A data warehouse stores structured, processed data optimized for SQL analytics (e.g., Snowflake). A data lake stores raw data in any format (structured, semi-structured, unstructured) typically on object storage (S3, ADLS).

Q: What is Spark's RDD and how does it differ from a DataFrame?

RDD (Resilient Distributed Dataset) is the low-level distributed collection with no schema. DataFrame has a schema (like a table) and uses Spark SQL's Catalyst optimizer, providing better performance and simpler APIs.