Skip to content

Data Warehousing & Star Schema

Coming in Weeks 10-12

This module covers the concepts and techniques needed for Phase 2 of the semester project.

What You'll Learn

  • OLTP vs OLAP — Why transactional databases aren't ideal for analytics
  • Star Schema Design — Fact tables, dimension tables, and how they fit together
  • ETL with PL/SQL — Extract from your 3NF tables, Transform, and Load into star schema
  • Analytical SQLGROUP BY, ROLLUP, CUBE, RANK(), DENSE_RANK(), window functions

Key Concepts

OLTP vs OLAP

OLTP (Phase 1) OLAP (Phase 2)
Purpose Day-to-day operations Analysis & reporting
Schema Normalized (3NF) Denormalized (Star)
Queries Short, targeted (INSERT/UPDATE) Complex, aggregated (GROUP BY, JOIN)
Example "Enroll student 101 in class 10125" "What's the average GPA by department over 3 years?"

Star Schema

           ┌─────────────┐
           │  dim_date    │
           └──────┬──────┘
┌──────────┐   ┌──┴───────────┐   ┌──────────────┐
│dim_student├───┤  fact_table  ├───┤ dim_category  │
└──────────┘   └──┬───────────┘   └──────────────┘
           ┌──────┴──────┐
           │ dim_location │
           └─────────────┘
  • Fact table — Contains measurable events (transactions, enrollments, sales) with foreign keys to dimensions and numeric measures
  • Dimension tables — Descriptive context (who, what, when, where)

ETL Process

-- Example: PL/SQL ETL procedure
CREATE OR REPLACE PROCEDURE load_fact_enrollments AS
BEGIN
    INSERT INTO fact_enrollments (student_key, course_key, date_key, grade_value, credit_hours)
    SELECT s.snum, c.cnum, TO_NUMBER(TO_CHAR(SYSDATE, 'YYYYMMDD')),
           CASE e.grade WHEN 'A' THEN 4 WHEN 'B' THEN 3 WHEN 'C' THEN 2 WHEN 'D' THEN 1 ELSE 0 END,
           c.crhr
    FROM Enrollments e
    JOIN SchClasses sc ON e.classnum = sc.classnum
    JOIN Courses c ON sc.cnum = c.cnum
    JOIN Students s ON e.snum = s.snum
    WHERE e.grade IS NOT NULL;

    COMMIT;
    DBMS_OUTPUT.PUT_LINE('Loaded ' || SQL%ROWCOUNT || ' rows into fact_enrollments');
END;
/

Three Types of Facts

Understanding fact types is critical — it determines how you can aggregate your data.

✅ Additive Facts

Can be summed across ALL dimensions. These are the most common and most useful.

Example Why Additive
Revenue $100 (Store A) + $200 (Store B) = $300 total ✅
Quantity Sold 10 (Monday) + 20 (Tuesday) = 30 total ✅
Cost Sum by product, by store, by time — all valid ✅

⚠️ Semi-Additive Facts

Can be summed across some dimensions, but NOT across time. These are typically snapshot measurements — what something is at a moment, not what happened.

Example Why Semi-Additive
Account Balance $5,000 Monday + $5,000 Tuesday ≠ $10,000! Use AVG or point-in-time.
Inventory Level 100 units today + 100 units yesterday ≠ 200 units. But SUM across warehouses works.
Headcount 50 employees in Dept A + 30 in Dept B = 80 total ✅. But NOT across months.

❌ Non-Additive Facts

Can never be meaningfully summed across any dimension. Must be computed from components.

Example Why Non-Additive
Unit Price Can't sum or average prices without weighting by quantity
Profit Margin % Ratios can't be summed
Temperature 70°F + 80°F = 150°F makes no sense

The Unit Price Trap

This is the classic mistake students make:

Qty Unit Price Revenue
Sale 1 1 $1.00 $1.00
Sale 2 4 $0.50 $2.00
Total 5 ??? $3.00
  • ❌ Simple average: ($1.00 + $0.50) / 2 = $0.75 — WRONG
  • ✅ Weighted average: $3.00 / 5 = $0.60 — CORRECT

Kimball's Golden Rule

"Ratio of the sums, not sum of the ratios."

Store the components (revenue + quantity) as additive facts in the fact table, then compute the ratio in the query. Never store a pre-computed ratio as a fact if you can avoid it.

-- ❌ WRONG: Averaging a non-additive fact
SELECT AVG(unit_price) FROM fact_sales;  -- Unweighted, misleading!

-- ✅ RIGHT: Compute from additive components
SELECT SUM(revenue) / SUM(quantity) AS weighted_avg_price 
FROM fact_sales;

Applying This to Your Exercises

When designing your star schemas, ask: "Can I SUM this across every dimension?"

Fact Type Why
CourseGrade (Millennium) ⚠️ Semi-additive AVG across students, not SUM
FaceValue (Fitchwood) ✅ Additive Total face value across agents, territories, time
Commission % (Fitchwood) ❌ Non-additive Store dollar amounts instead
InitComm % (Fitchwood) ❌ Non-additive Compute: FaceValue × InitComm%

Analytical Queries

-- ROLLUP: Subtotals by department and course
SELECT c.dept, c.ctitle, COUNT(*) as enrollments, AVG(s.gpa) as avg_gpa
FROM fact_enrollments f
JOIN dim_course c ON f.course_key = c.cnum
JOIN dim_student s ON f.student_key = s.snum
GROUP BY ROLLUP(c.dept, c.ctitle);

-- RANK: Top students by GPA within each major
SELECT sname, majorid, gpa,
       RANK() OVER (PARTITION BY majorid ORDER BY gpa DESC) as rank_in_major
FROM Students
WHERE majorid IS NOT NULL;

Resources

  • Kimball, R. — The Data Warehouse Toolkit (the star schema bible)
  • Oracle documentation: ROLLUP, CUBE, analytical functions
  • Project Spec — Phase 2

Detailed content will be added as we cover this material in class.