Kimball Dimensional Modeling — Student Cheat Sheet¶

Based on The Data Warehouse Toolkit, 3rd Edition by Ralph Kimball & Margy Ross.
Prepared for IS480 — Advanced Database Management, CSULB Spring 2026.

The Big Idea¶

Organizations have two different data needs that require two different systems:

	OLTP (Operational)	OLAP (Analytical / Data Warehouse)
Purpose	Run the business	Analyze the business
Users	Clerks, apps, front-line workers	Managers, analysts, executives
Operations	INSERT, UPDATE, DELETE (one row at a time)	SELECT with GROUP BY (millions of rows)
Schema	Normalized (3NF) — no redundancy	Denormalized (Star) — redundancy is OK
Data	Current state only	Historical snapshots over time
Speed goal	Fast writes	Fast reads & aggregations

Key insight: You put data INTO operational systems. You get data OUT OF warehouses.

The 4-Step Design Process¶

Every dimensional model follows these four steps (Kimball Ch 2-3):

Step 1: Select the Business Process¶

What activity generates measurable events? (e.g., sales, enrollments, appointments)
Business processes are verbs: selling, ordering, registering, billing
❌ NOT a department ("marketing") — that's who uses the data, not what generates it

Step 2: Declare the Grain¶

"What does one row in the fact table represent?"
Express in business terms: "One row per product sold per transaction"
Always start with the finest (atomic) grain — you can always roll up, but can't drill down from summaries
⚠️ Most common design mistake: Skipping this step or mixing grains in one table

Step 3: Identify the Dimensions¶

Ask: "Who, what, where, when, why, how?" for each event
Common dimensions: Date, Product, Customer, Employee, Location, Promotion
Each dimension takes on exactly one value per fact row
If a dimension would cause extra rows → wrong grain or wrong dimension

Step 4: Identify the Facts¶

Ask: "What is the process measuring?"
Facts are numeric values you want to SUM, AVG, COUNT
Must be true to the grain declared in Step 2
If a fact belongs to a different grain → put it in a separate fact table

Star Schema Anatomy¶

                    ┌──────────────┐
                    │  dim_date    │  ← DIMENSION: descriptive context
                    │──────────────│     (wide, denormalized, ~thousands of rows)
                    │ date_key  PK │
                    │ full_date    │
                    │ day_of_week  │
                    │ month, qtr   │
                    │ year         │
                    │ is_holiday   │
                    └──────┬───────┘
                           │
  ┌──────────────┐  ┌──────┴──────────────┐  ┌──────────────┐
  │ dim_product  │  │    fact_sales       │  │ dim_store    │
  │──────────────│  │─────────────────────│  │──────────────│
  │ product_key  ├──┤ date_key       FK   ├──┤ store_key    │
  │ product_name │  │ product_key    FK   │  │ store_name   │
  │ brand        │  │ store_key      FK   │  │ city, state  │
  │ category     │  │ promotion_key  FK   │  │ sq_footage   │
  │ department   │  │                     │  └──────────────┘
  │ package_type │  │ ── MEASURES ──      │
  └──────────────┘  │ sales_quantity      │  ← FACTS: numeric, additive
                    │ unit_price          │
                    │ extended_amount     │
                    │ extended_cost       │
                    │ gross_profit        │
                    └─────────────────────┘
                         ↑
               FACT TABLE: events/measurements
               (narrow, very tall — millions/billions of rows)

Fact Table Rules¶

Rule	Details
One row = one event	Each row represents a real-world measurement event
Foreign keys to dimensions	One FK column per dimension
Numeric measures	SUM-able, AVG-able numbers
Grain must be uniform	Never mix different grains in the same table
No text in fact tables	Descriptive text belongs in dimensions

Three Types of Facts¶

Type	Can SUM across...	Example
Additive	ALL dimensions	Revenue, quantity, cost
Semi-additive	Some dimensions (not time)	Account balance, inventory level
Non-additive	NO dimensions (compute from components)	Unit price, ratios, percentages

💡 Pro tip: Store the additive components (numerator & denominator), then compute ratios in your query or BI tool. "Ratio of the sums, not sum of the ratios."

Three Types of Fact Tables¶

Type	What it captures	Row behavior
Transaction	Individual events (a sale, a click)	Inserted once, never updated
Periodic Snapshot	State at regular intervals (daily balance)	One row per period, even if nothing happened
Accumulating Snapshot	Pipeline milestones (order → ship → deliver)	Row is updated as milestones are reached

Factless Fact Tables¶

Fact table with no numeric measures — just foreign keys
Use case 1: Event tracking — "Student attended class on this date" (no number, but the event matters)
Use case 2: Coverage analysis — "What didn't happen?" (e.g., which products were on promotion but had zero sales?)

Dimension Table Rules¶

Rule	Details
Single primary key	Use a surrogate key (integer, assigned sequentially)
Wide and flat	Many descriptive columns, denormalized
Text-heavy	Verbose descriptions, not codes
Denormalized	Flatten hierarchies into one table (e.g., product + category + department all in `dim_product`)
Relatively small	Thousands to low millions of rows

Why Surrogate Keys?¶

Don't use the business key (e.g., employee_id) as the dimension PK because:

Business keys change (employee rehired with new ID)
Multiple source systems may have conflicting keys
You need multiple rows for the same entity when tracking changes (SCD Type 2)
Surrogate keys are simpler and uniformly formatted

Exception: The dim_date table can use YYYYMMDD integer as its key (it's stable and meaningful).

The Date Dimension¶

Every star schema needs one. Pre-populate it with 10-20 years of days (~7,300 rows for 20 years).

Key columns:

Column	Example	Why?
date_key	20260318	PK (YYYYMMDD format)
full_date	2026-03-18	Actual DATE type
day_of_week	Wednesday	Compare Mon vs Sun sales
month_name	March	GROUP BY month
quarter	Q1	Quarterly reports
year	2026	Year-over-year analysis
is_weekend	Y/N	Filter weekdays/weekends
is_holiday	Y/N	Holiday sales patterns
fiscal_quarter	FQ4	If fiscal ≠ calendar year

💡 Why not just use a DATE column in the fact table? Because you can't easily GROUP BY "is_holiday" or "fiscal_quarter" with a raw date. The dimension pre-computes these attributes.

Slowly Changing Dimensions (SCD)¶

What happens when dimension data changes? (e.g., employee changes department, customer moves to new city)

Type	Strategy	Tradeoff
Type 0	Retain original value forever	No history tracking
Type 1	Overwrite old value	Lose history, but always current
Type 2	Add new row with new surrogate key	Most common. Full history preserved. Requires effective/expiration dates.
Type 3	Add "previous" column	Only tracks one change (current + previous)

Type 2 example: Doctor changes department

doctor_key	doctor_id	name	department	effective_date	expiration_date	current
101	D-5432	Dr. Park	Cardiology	2020-01-15	2025-06-30	N
247	D-5432	Dr. Park	Neurology	2025-07-01	9999-12-31	Y

Note: Two surrogate keys (101, 247) for the same doctor. Historical facts link to key 101 (Cardiology era), new facts link to key 247 (Neurology era).

Star vs Snowflake¶

	Star Schema	Snowflake Schema
Dimensions	Denormalized (flat)	Normalized (broken into sub-tables)
Joins	Fewer (fact → dimension)	More (fact → dimension → sub-dimension)
Query speed	Faster ✅	Slower ❌
Storage	More (redundancy)	Less
Ease of use	Simpler ✅	More complex
Kimball's recommendation	Preferred	Use sparingly

STAR:                              SNOWFLAKE:
dim_product                        dim_product
├── product_name                   ├── product_name
├── brand_name      ←flattened     ├── brand_key → dim_brand
├── category_name   ←flattened     ├── category_key → dim_category
├── department_name ←flattened     └── ...
└── ...                            dim_brand
                                   ├── brand_name
                                   └── brand_country

💡 Kimball strongly recommends star over snowflake for analysis. Snowflaking saves storage but adds JOIN complexity and slows queries. The storage savings are negligible compared to the massive fact table.

Estimating Fact Table Size¶

Quick estimation formula:

Total rows = Product of unique values across dimensions
           × Fill rate (what % of combinations actually occur)

Example (Retail):
  1,000 stores × 60,000 products × 365 days × 10% fill rate
  = ~2.2 billion rows/year

Bytes = Rows × Columns × Avg bytes/column

ETL: Extract, Transform, Load¶

The process of moving data from OLTP → Star Schema:

EXTRACT          →    TRANSFORM           →    LOAD
Read from             Clean & reshape          Insert into
source systems        • Deduplicate            star schema
                      • Standardize formats    • Dimensions FIRST
                      • Derive new fields      • Facts SECOND
                      • Handle NULLs           • Index & optimize
                      • Map surrogate keys

Key ETL principles: 1. Load dimensions before facts (facts need dimension FKs) 2. Handle data quality — inconsistent formats, missing values, duplicates 3. Assign surrogate keys during dimension loads 4. Track changes — implement SCD logic during dimension updates 5. Initial load vs incremental — first load is full extract; subsequent loads capture only changes (CDC = Changed Data Capture)

10 Essential Rules for Dimensional Modeling¶

(From Kimball, Chapter 2)

Use atomic facts — Start with the finest grain
Create single-process fact tables — One business process per fact table
Include a date dimension for every fact table
Enforce consistent grain — Don't mix grains
Disallow null foreign keys in fact tables (use "Unknown" dimension row instead)
Honor hierarchies in dimensions (flatten them, don't snowflake)
Decode dimension tables — Use descriptive text, not codes
Use surrogate keys for all dimensions (except date)
Conform dimensions — Same dimension means same thing everywhere
Balance requirements with actual data — Design from both business needs AND data realities

Quick Reference: Vocabulary¶

Term	Definition
Grain	What one row in the fact table represents
Measure / Fact	A numeric value to aggregate (SUM, AVG)
Dimension	Descriptive context (who, what, where, when)
Surrogate key	Artificial integer PK for dimensions
Natural key	The original business key (e.g., SSN, employee_id)
Degenerate dimension	A dimension with no table — just a value in the fact table (e.g., invoice #)
Conformed dimension	A dimension shared across multiple fact tables with identical meaning
Additive fact	Can be summed across all dimensions
Semi-additive fact	Can be summed across some dimensions (not time)
Non-additive fact	Cannot be summed — must compute from components
SCD	Slowly Changing Dimension — strategy for handling dimension changes
Snowflake	Normalizing dimension tables (generally avoided)
Bus matrix	Enterprise-wide plan showing which dimensions apply to which business processes

Kimball Reading Guide for IS480¶

Week	Reading	Key Takeaway
10	Ch 1 (pp. 1-35)	Why warehouses exist, star schema basics, DW architecture
11	Ch 2 (pp. 37-68)	Complete technique reference — 4-step process, all fact/dimension types
11	Ch 3 (pp. 69-100)	Retail case study — see the 4 steps applied to a real business

This cheat sheet covers the essential Kimball concepts for IS480. For the complete treatment with industry case studies, refer to the full textbook.