Apache Spark & Hive¶

Coming in Weeks 13-15

This module covers the concepts and techniques needed for Phase 3 of the semester project.

What You'll Learn¶

Why distributed computing? — What happens when your data outgrows a single database
Apache Spark — Distributed processing engine, 100x faster than MapReduce
Spark SQL — Write SQL that runs on distributed data
Apache Hive — SQL interface over distributed file systems
PySpark — Python API for Spark (connects to ML pipelines)
Databricks — Free cloud platform for Spark (Community Edition)

The Big Picture¶

Your Oracle DB                    The Hadoop Ecosystem
┌──────────────┐                  ┌──────────────────────────┐
│ Single server│                  │  Hundreds of machines    │
│ ~10K rows    │                  │  Billions of rows        │
│ Milliseconds │                  │                          │
│              │   Same SQL!      │  Spark SQL / HiveQL      │
│ SELECT ...   │ ─────────────►   │  SELECT ...              │
│ GROUP BY ... │                  │  GROUP BY ...             │
│              │                  │                          │
│ Oracle engine│                  │  Spark/Hive engine       │
│ (1 machine)  │                  │  (distributed cluster)   │
└──────────────┘                  └──────────────────────────┘

Key Concepts¶

Same Query, Three Ways¶

Oracle PL/SQL:

SELECT department_name, SUM(total_cost) as revenue
FROM fact_appointments f
JOIN dim_department d ON f.department_key = d.department_key
GROUP BY department_name
ORDER BY revenue DESC;

Spark SQL (on Databricks):

-- Looks identical! But runs on a distributed cluster
SELECT department_name, SUM(total_cost) as revenue
FROM fact_appointments f
JOIN dim_department d ON f.department_key = d.department_key
GROUP BY department_name
ORDER BY revenue DESC;

PySpark (Python API):

from pyspark.sql import functions as F

(fact_df
    .join(dept_df, "department_key")
    .groupBy("department_name")
    .agg(F.sum("total_cost").alias("revenue"))
    .orderBy(F.desc("revenue"))
    .show())

Why Spark Over Oracle at Scale?¶

Aspect	Oracle (Single Node)	Spark (Distributed)
Data size	GBs	PBs (petabytes)
Processing	1 machine	Hundreds of machines in parallel
Speed	Fast for small data	Fast for massive data (in-memory)
Cost	Expensive licenses	Open source, pay for compute only
Use case	OLTP + small OLAP	Big data analytics, ML pipelines

Databricks Setup (Free)¶

Sign up at community.cloud.databricks.com
Create a notebook (Python + SQL)
Upload CSV files from your Oracle star schema
Create Hive tables:

CREATE TABLE fact_appointments
USING CSV
OPTIONS (path '/FileStore/tables/fact_appointments.csv', 
         header 'true', inferSchema 'true');

Query with SparkSQL — same syntax as Oracle!

Career Connection¶

These skills are in demand

Netflix ($466K-$750K), Spotify, Uber, Airbnb — all list Spark and Hive as required or preferred skills for ML and data engineering roles. Learning them now puts you ahead.

Resources¶

Detailed content, hands-on tutorials, and Databricks walkthrough will be added as we cover this material in class.