Week 10: Data Warehousing & Star Schema Design¶

Reading due this week: Kimball, The Data Warehouse Toolkit — Chapter 1 (pp. 1-35)
Reading for next week: Kimball Ch 2 (pp. 37-68) + Ch 3 (pp. 69-100)

Part 1: Why Data Warehouses? (20 min)¶

The Problem¶

You've spent 9 weeks building transactional databases with PL/SQL. These OLTP systems are great at their job — processing one transaction at a time, fast.

But try answering this question from your CEO:

"Show me revenue by department, by quarter, for the last 3 years — and compare it to the same quarters last year."

In your normalized 3NF database, this requires:

5+ JOINs across normalized tables
Aggregations across millions of rows
The query competes with real-time transactions for resources
It takes minutes (or hours) on large datasets

This is the wrong tool for the job. You need a system designed for analysis, not transactions.

Two Worlds of Data (Kimball Ch 1, pp. 2-3)¶

"The operational systems are where you put the data in, and the DW/BI system is where you get the data out." — Kimball

	OLTP (Operational)	OLAP (Data Warehouse)
Purpose	Run the business day-to-day	Analyze the business for decisions
Users	Clerks, apps, front-line workers	Managers, analysts, executives
Operations	INSERT/UPDATE/DELETE one row	SELECT + GROUP BY over millions
Schema	Normalized (3NF)	Denormalized (Star Schema)
Data	Current state — overwrites old data	Historical snapshots — never deletes
Optimization	Fast writes, data integrity	Fast reads, query performance
Volume	GBs (current data)	TBs to PBs (years of history)

(This maps to Slide 2 of Chapter 9 textbook slides and Kimball Ch 1 "Different Worlds")

Data Characteristics: Transient vs. Periodic¶

(From Chapter 9 slides, Slides 13-16)

Transient data (OLTP): Changes are written over previous values. When a customer updates their address, the old address is gone.
Periodic data (Warehouse): Data is never overwritten. Every state is preserved as a historical snapshot. You can always go back and see what the data looked like at any point in time.

Real-World Architecture¶

(Expanded from Chapter 9 Slide 6 — 3-Layer Architecture)

         OLTP Source Systems                    Data Warehouse
    ┌──────────────────────┐               ┌─────────────────────┐
    │  Hospital EHR        │──┐            │                     │
    └──────────────────────┘  │   ETL      │   Star Schema       │
    ┌──────────────────────┐  ├───────────►│                     │
    │  Billing System      │──┤  Extract   │  ┌──────────┐       │
    └──────────────────────┘  │  Transform │  │  FACT    │       │──► Reports
    ┌──────────────────────┐  │  Load      │  │  table   │       │──► Dashboards
    │  Scheduling App      │──┘            │  └────┬─────┘       │──► ML Models
    └──────────────────────┘               │  ┌────┼────┐        │
                                           │  DIM  DIM  DIM      │
                                           └─────────────────────┘

Netflix processes petabytes of viewing data in their warehouse for recommendations.
Walmart loads 1+ million transactions/hour into their warehouse.
Amazon uses warehousing to analyze billions of purchase events for pricing and logistics.

Warehouse vs. Data Mart¶

(From Chapter 9 Slides 8-10)

	Data Warehouse	Data Mart
Scope	Enterprise-wide	One department or subject
Data	All subjects, many sources	One subject, few sources
Size	Large	Starts small
Lifecycle	Long-lived	Shorter-lived
Planning	Centralized, planned	Decentralized, organic

A data mart is essentially a mini-warehouse focused on one business area (e.g., just sales, just HR).

Part 2: Star Schema Design (25 min)¶

Kimball's 4-Step Design Process¶

(Kimball Ch 2, pp. 38-40 — the foundation of every dimensional model)

Step	Question to Answer	Example (Hospital)
1. Select the business process	What activity generates measurable events?	Patient appointments
2. Declare the grain	What does ONE row represent?	One appointment
3. Identify the dimensions	Who, what, where, when, how?	Patient, Doctor, Date, Department
4. Identify the facts	What numbers do we measure?	Cost, wait time, prescription count

⚠️ Kimball's #1 warning: The most common mistake is skipping Step 2 (grain). If the grain isn't defined, the entire design rests on quicksand.

Star Schema Anatomy¶

(From Chapter 9 Slide 19-20 and Kimball Ch 1, pp. 16-17)

                        ┌─────────────────┐
                        │   dim_date      │
                        │─────────────────│
                        │ date_key     PK │
                        │ full_date       │
                        │ day_of_week     │
                        │ month, quarter  │
                        │ year            │
                        │ is_holiday      │
                        └────────┬────────┘
                                 │
┌─────────────────┐   ┌─────────┴──────────────┐   ┌─────────────────┐
│  dim_patient    │   │   fact_appointments    │   │ dim_department  │
│─────────────────│   │────────────────────────│   │─────────────────│
│ patient_key  PK ├───┤ appointment_key    PK  ├───┤ dept_key     PK │
│ name            │   │ patient_key        FK  │   │ dept_name       │
│ age_group       │   │ doctor_key         FK  │   │ building        │
│ gender          │   │ date_key           FK  │   │ floor           │
│ insurance_type  │   │ dept_key           FK  │   └─────────────────┘
│ zip_code        │   │                        │
└─────────────────┘   │ ── Measures ──         │   ┌─────────────────┐
                      │ visit_count            │   │  dim_doctor     │
                      │ total_cost             │   │─────────────────│
                      │ wait_time_min          │───┤ doctor_key   PK │
                      │ prescription_count     │   │ doctor_name     │
                      └────────────────────────┘   │ specialty       │
                                                   │ hire_year       │
                                                   └─────────────────┘

It's called a "star" because the fact table sits in the center with dimensions radiating outward.

Fact Table Rules¶

(Kimball Ch 2, pp. 41-45)

Rule	Details
One row = one event	Each row represents a real-world measurement
Foreign keys to dimensions	One FK per dimension table
Numeric measures only	Numbers you want to SUM, AVG, COUNT
Uniform grain	Never mix different grains in the same table
No descriptive text	Text belongs in dimensions, not facts

Three types of facts:

Type	Can SUM across...	Example
Additive	All dimensions	Revenue, quantity, cost
Semi-additive	Some (not time)	Account balance
Non-additive	None — compute from parts	Unit price, ratios, margins

💡 Kimball's rule: "Store the ratio's components, not the ratio. Calculate ratio of the sums, not sum of the ratios."

Example: You sold 1 widget at $1.00 and 4 widgets at $0.50 each. - ✅ Correct average price: Total revenue ($3.00) ÷ Total quantity (5) = $0.60 - ❌ Wrong: Average of unit prices ($1.00 + $0.50) ÷ 2 = $0.75

Dimension Table Rules¶

(Kimball Ch 2, pp. 46-50)

Rule	Details
Surrogate keys	Use simple integers (1, 2, 3...) — NOT business keys
Wide and flat	Many descriptive columns
Denormalized	Flatten hierarchies (product + brand + category in one table)
Text-heavy	Use "Cardiology", not "DEPT-04"
Relatively small	Thousands to low millions of rows

Why surrogate keys? (Chapter 9 Slide 22) - Business keys change (employee rehired, product renumbered) - Multiple source systems have conflicting keys - Need multiple rows per entity for change tracking (SCD)

The Date Dimension: Everyone Needs One¶

(Kimball Ch 3, pp. 79-83)

Pre-populate with 10-20 years of days (~7,300 rows). Every star schema has one.

CREATE TABLE dim_date (
    date_key       NUMBER PRIMARY KEY,    -- YYYYMMDD (e.g., 20260318)
    full_date      DATE NOT NULL,
    day_of_week    VARCHAR2(10),           -- 'Monday', 'Tuesday', ...
    day_of_month   NUMBER,
    month_num      NUMBER,
    month_name     VARCHAR2(10),
    quarter        NUMBER,                 -- 1, 2, 3, 4
    year           NUMBER,
    is_weekend     CHAR(1),               -- 'Y' or 'N'
    is_holiday     CHAR(1)
);

Why not just use a DATE column in the fact table?

Because you can't GROUP BY is_holiday or WHERE fiscal_quarter = 'FQ3' with a raw date. The dimension pre-computes these attributes for fast, flexible analysis.

Estimating Fact Table Size¶

(Chapter 9 Slide 26 & Kimball Ch 3, p. 79)

Total rows = unique values per dimension₁ × dimension₂ × ... × fill rate

Example (Hospital):
  500 doctors × 10,000 patients × 365 days × 20% fill rate
  = 365,000,000 rows/year

Size in bytes = rows × columns × avg bytes/column
  = 365M × 8 columns × 15 bytes = ~44 GB/year

This is why warehouses get BIG — and why we'll need Spark in a few weeks.

Part 3: Star vs Snowflake (10 min)¶

(Chapter 9 Slides 31-33 & Kimball Ch 2, p. 50)

Snowflaking = normalizing dimension tables by breaking them into sub-tables.

STAR (recommended):                SNOWFLAKE:
┌──────────────────┐               ┌──────────────────┐
│   dim_product    │               │   dim_product    │
│──────────────────│               │──────────────────│
│ product_key   PK │               │ product_key   PK │
│ product_name     │               │ product_name     │
│ brand_name       │ ←flattened    │ brand_key      FK│──→ dim_brand
│ category_name    │ ←flattened    │ category_key   FK│──→ dim_category
│ department_name  │ ←flattened    └──────────────────┘
└──────────────────┘                        ↓ More JOINs!

	Star	Snowflake
Query speed	Faster ✅	Slower (more JOINs)
Ease of use	Simpler ✅	More complex
Storage	More (redundancy)	Less
Kimball says	Use this	Use sparingly

💡 The storage savings from snowflaking are negligible — dimension tables are tiny compared to the massive fact table. Don't sacrifice query performance for trivial storage savings.

Part 4: Slowly Changing Dimensions (10 min)¶

(Chapter 9 Slide 35 & Kimball Ch 2, pp. 53-56)

What happens when a doctor changes departments? A customer moves to a new state?

SCD Type	Strategy	History?	Example
Type 1	Overwrite old value	❌ Lost	Update Dr. Park's dept to Neurology
Type 2	Add new row (new surrogate key)	✅ Full	New row for Dr. Park in Neurology; old row preserved
Type 3	Add "previous_value" column	Partial	`current_dept` = Neurology, `previous_dept` = Cardiology

Type 2 is the most common — it preserves full history:

doctor_key	doctor_id	name	department	effective_date	expiration_date	is_current
101	D-5432	Dr. Park	Cardiology	2020-01-15	2025-06-30	N
247	D-5432	Dr. Park	Neurology	2025-07-01	9999-12-31	Y

Historical appointments link to key 101 (Cardiology era). New appointments link to key 247 (Neurology era). History is preserved!

Part 5: ETL — Extract, Transform, Load (15 min)¶

(Chapter 9 Slides 42-50 & Kimball Ch 1, pp. 19-21)

The ETL Pipeline¶

  EXTRACT                TRANSFORM                    LOAD
┌──────────────┐    ┌───────────────────┐     ┌────────────────┐
│ Read from    │    │ Clean & reshape   │     │ Insert into    │
│ OLTP tables  │───►│ • Deduplicate     │────►│ star schema    │
│ (SELECT)     │    │ • Standardize     │     │ • Dims FIRST   │
└──────────────┘    │ • Derive fields   │     │ • Facts SECOND │
                    │ • Handle NULLs    │     │ • Index        │
                    │ • Map surr. keys  │     └────────────────┘
                    └───────────────────┘

Data Quality Issues¶

(Chapter 9 Slides 46-50 — Transformation Functions)

Problem	Example	Solution
Inconsistent keys	Same customer, different IDs across systems	Surrogate key mapping
Synonyms	"CA" vs "California" vs "Calif."	Standardize during Transform
Missing data	NULL insurance type	Default "Unknown" dimension row
Duplicate records	Same appointment entered twice	Deduplication logic
Format mismatches	"03/18/2026" vs "2026-03-18" vs "18-MAR-26"	Canonical format conversion

ETL in PL/SQL (Your Phase 2!)¶

For this course, ETL = PL/SQL procedures you already know how to write:

-- Load a dimension table
CREATE OR REPLACE PROCEDURE load_dim_doctor AS
BEGIN
    DELETE FROM dim_doctor;

    INSERT INTO dim_doctor (doctor_key, doctor_name, specialty, department, hire_year)
    SELECT d.doctor_id,
           d.name,
           s.spec_name,        -- JOIN to denormalize
           dept.dept_name,     -- JOIN to denormalize
           EXTRACT(YEAR FROM d.hire_date)
    FROM Doctors d
    JOIN Specialties s ON d.specialty_id = s.spec_id
    JOIN Departments dept ON d.department_id = dept.dept_id;

    COMMIT;
    DBMS_OUTPUT.PUT_LINE('Loaded ' || SQL%ROWCOUNT || ' rows into dim_doctor');
END;
/

-- Load the fact table (AFTER dimensions)
CREATE OR REPLACE PROCEDURE load_fact_appointments AS
BEGIN
    DELETE FROM fact_appointments;

    INSERT INTO fact_appointments (
        appointment_key, patient_key, doctor_key, date_key, dept_key,
        visit_count, total_cost
    )
    SELECT a.appt_id,
           a.patient_id,
           a.doctor_id,
           TO_NUMBER(TO_CHAR(a.appt_date, 'YYYYMMDD')),  -- date surrogate key
           d.department_id,
           1,                    -- each row = 1 visit
           a.cost
    FROM Appointments a
    JOIN Doctors d ON a.doctor_id = d.doctor_id
    WHERE a.status = 'COMPLETED';

    COMMIT;
    DBMS_OUTPUT.PUT_LINE('Loaded ' || SQL%ROWCOUNT || ' facts');
END;
/

-- Master ETL: run everything in order
CREATE OR REPLACE PROCEDURE run_full_etl AS
BEGIN
    DBMS_OUTPUT.PUT_LINE('=== Starting ETL ===');
    load_dim_date;           -- dimensions first!
    load_dim_patient;
    load_dim_doctor;
    load_dim_department;
    load_fact_appointments;  -- facts last (need dimension FKs)
    DBMS_OUTPUT.PUT_LINE('=== ETL Complete ===');
END;
/

Key principle: Always load dimensions FIRST, then facts.

Part 6: Analytical SQL Preview (10 min)¶

Once the warehouse is loaded, you unlock powerful analytical queries. We'll go deep on these next week — here's a taste:

ROLLUP — Automatic Subtotals¶

SELECT NVL(d.dept_name, '** ALL **') AS department,
       NVL(doc.doctor_name, '** All Doctors **') AS doctor,
       COUNT(*) AS appointments,
       SUM(f.total_cost) AS revenue
FROM fact_appointments f
JOIN dim_department d ON f.dept_key = d.dept_key
JOIN dim_doctor doc ON f.doctor_key = doc.doctor_key
GROUP BY ROLLUP(d.dept_name, doc.doctor_name);

RANK — Top-N Analysis¶

-- Top 3 doctors by revenue per department
SELECT * FROM (
    SELECT d.dept_name, doc.doctor_name, SUM(f.total_cost) AS revenue,
           RANK() OVER (PARTITION BY d.dept_name ORDER BY SUM(f.total_cost) DESC) AS rnk
    FROM fact_appointments f
    JOIN dim_department d ON f.dept_key = d.dept_key
    JOIN dim_doctor doc ON f.doctor_key = doc.doctor_key
    GROUP BY d.dept_name, doc.doctor_name
) WHERE rnk <= 3;

Next week: Full deep-dive into ROLLUP, CUBE, RANK, DENSE_RANK, LAG, and window functions.

Part 7: Connecting to Your Project (5 min)¶

Phase 2 of your semester project requires exactly what we covered today:

Deliverable	What you need
Star schema	1 fact table + 3-5 dimension tables for YOUR domain
ETL procedures	PL/SQL procedures to extract from Phase 1 tables → load into star schema
Analytical queries	5+ queries using ROLLUP, RANK, CUBE, etc.
Documentation	Star schema diagram + explanation of each dimension

Homework¶

Read: Kimball Ch 1 (pp. 1-35) if you haven't yet
Read for next week: Kimball Ch 2 (pp. 37-68) + Ch 3 (pp. 69-100)
With your team: Start thinking about your star schema using the 4-step process:
- What business process will you model?
- What's the grain?
- What are the dimensions?
- What are the facts/measures?
Download: Kimball Cheat Sheet — print it, keep it handy

Key Takeaways¶

OLTP ≠ OLAP — Different purposes, different designs, different systems
Star Schema = Fact + Dimensions — Facts are events/measures, dimensions are context
4-Step Process — Business process → Grain → Dimensions → Facts (never skip grain!)
ETL is PL/SQL — You already have the skills
Surrogate keys, not business keys — For flexibility and history tracking
This is what gets you hired — Star schemas and ETL appear in every data engineering job posting

IS480 — Advanced Database Management | Week 10 | CSULB Spring 2026 | Prof. Jose Pineda