Data Warehousing & Star Schema¶

Coming in Weeks 10-12

This module covers the concepts and techniques needed for Phase 2 of the semester project.

What You'll Learn¶

OLTP vs OLAP — Why transactional databases aren't ideal for analytics
Star Schema Design — Fact tables, dimension tables, and how they fit together
ETL with PL/SQL — Extract from your 3NF tables, Transform, and Load into star schema
Analytical SQL — GROUP BY, ROLLUP, CUBE, RANK(), DENSE_RANK(), window functions

Key Concepts¶

OLTP vs OLAP¶

	OLTP (Phase 1)	OLAP (Phase 2)
Purpose	Day-to-day operations	Analysis & reporting
Schema	Normalized (3NF)	Denormalized (Star)
Queries	Short, targeted (INSERT/UPDATE)	Complex, aggregated (GROUP BY, JOIN)
Example	"Enroll student 101 in class 10125"	"What's the average GPA by department over 3 years?"

Star Schema¶

           ┌─────────────┐
           │  dim_date    │
           └──────┬──────┘
                  │
┌──────────┐   ┌──┴───────────┐   ┌──────────────┐
│dim_student├───┤  fact_table  ├───┤ dim_category  │
└──────────┘   └──┬───────────┘   └──────────────┘
                  │
           ┌──────┴──────┐
           │ dim_location │
           └─────────────┘

Fact table — Contains measurable events (transactions, enrollments, sales) with foreign keys to dimensions and numeric measures
Dimension tables — Descriptive context (who, what, when, where)

ETL Process¶

-- Example: PL/SQL ETL procedure
CREATE OR REPLACE PROCEDURE load_fact_enrollments AS
BEGIN
    INSERT INTO fact_enrollments (student_key, course_key, date_key, grade_value, credit_hours)
    SELECT s.snum, c.cnum, TO_NUMBER(TO_CHAR(SYSDATE, 'YYYYMMDD')),
           CASE e.grade WHEN 'A' THEN 4 WHEN 'B' THEN 3 WHEN 'C' THEN 2 WHEN 'D' THEN 1 ELSE 0 END,
           c.crhr
    FROM Enrollments e
    JOIN SchClasses sc ON e.classnum = sc.classnum
    JOIN Courses c ON sc.cnum = c.cnum
    JOIN Students s ON e.snum = s.snum
    WHERE e.grade IS NOT NULL;

    COMMIT;
    DBMS_OUTPUT.PUT_LINE('Loaded ' || SQL%ROWCOUNT || ' rows into fact_enrollments');
END;
/

Three Types of Facts¶

Understanding fact types is critical — it determines how you can aggregate your data.

✅ Additive Facts¶

Can be summed across ALL dimensions. These are the most common and most useful.

Example	Why Additive
Revenue	$100 (Store A) + $200 (Store B) = $300 total ✅
Quantity Sold	10 (Monday) + 20 (Tuesday) = 30 total ✅
Cost	Sum by product, by store, by time — all valid ✅

⚠️ Semi-Additive Facts¶

Can be summed across some dimensions, but NOT across time. These are typically snapshot measurements — what something is at a moment, not what happened.

Example	Why Semi-Additive
Account Balance	$5,000 Monday + $5,000 Tuesday ≠ $10,000! Use AVG or point-in-time.
Inventory Level	100 units today + 100 units yesterday ≠ 200 units. But SUM across warehouses works.
Headcount	50 employees in Dept A + 30 in Dept B = 80 total ✅. But NOT across months.

❌ Non-Additive Facts¶

Can never be meaningfully summed across any dimension. Must be computed from components.

Example	Why Non-Additive
Unit Price	Can't sum or average prices without weighting by quantity
Profit Margin %	Ratios can't be summed
Temperature	70°F + 80°F = 150°F makes no sense

The Unit Price Trap¶

This is the classic mistake students make:

	Qty	Unit Price	Revenue
Sale 1	1	$1.00	$1.00
Sale 2	4	$0.50	$2.00
Total	5	???	$3.00

❌ Simple average: ($1.00 + $0.50) / 2 = $0.75 — WRONG
✅ Weighted average: $3.00 / 5 = $0.60 — CORRECT

Kimball's Golden Rule

"Ratio of the sums, not sum of the ratios."

Store the components (revenue + quantity) as additive facts in the fact table, then compute the ratio in the query. Never store a pre-computed ratio as a fact if you can avoid it.

-- ❌ WRONG: Averaging a non-additive fact
SELECT AVG(unit_price) FROM fact_sales;  -- Unweighted, misleading!

-- ✅ RIGHT: Compute from additive components
SELECT SUM(revenue) / SUM(quantity) AS weighted_avg_price 
FROM fact_sales;

Applying This to Your Exercises¶

When designing your star schemas, ask: "Can I SUM this across every dimension?"

Fact	Type	Why
CourseGrade (Millennium)	⚠️ Semi-additive	AVG across students, not SUM
FaceValue (Fitchwood)	✅ Additive	Total face value across agents, territories, time
Commission % (Fitchwood)	❌ Non-additive	Store dollar amounts instead
InitComm % (Fitchwood)	❌ Non-additive	Compute: FaceValue × InitComm%

Analytical Queries¶

-- ROLLUP: Subtotals by department and course
SELECT c.dept, c.ctitle, COUNT(*) as enrollments, AVG(s.gpa) as avg_gpa
FROM fact_enrollments f
JOIN dim_course c ON f.course_key = c.cnum
JOIN dim_student s ON f.student_key = s.snum
GROUP BY ROLLUP(c.dept, c.ctitle);

-- RANK: Top students by GPA within each major
SELECT sname, majorid, gpa,
       RANK() OVER (PARTITION BY majorid ORDER BY gpa DESC) as rank_in_major
FROM Students
WHERE majorid IS NOT NULL;

Resources¶

Kimball, R. — The Data Warehouse Toolkit (the star schema bible)
Oracle documentation: ROLLUP, CUBE, analytical functions
Project Spec — Phase 2

Detailed content will be added as we cover this material in class.