Kimball Dimensional Modeling — Student Cheat Sheet¶
Based on The Data Warehouse Toolkit, 3rd Edition by Ralph Kimball & Margy Ross.
Prepared for IS480 — Advanced Database Management, CSULB Spring 2026.
The Big Idea¶
Organizations have two different data needs that require two different systems:
| OLTP (Operational) | OLAP (Analytical / Data Warehouse) | |
|---|---|---|
| Purpose | Run the business | Analyze the business |
| Users | Clerks, apps, front-line workers | Managers, analysts, executives |
| Operations | INSERT, UPDATE, DELETE (one row at a time) | SELECT with GROUP BY (millions of rows) |
| Schema | Normalized (3NF) — no redundancy | Denormalized (Star) — redundancy is OK |
| Data | Current state only | Historical snapshots over time |
| Speed goal | Fast writes | Fast reads & aggregations |
Key insight: You put data INTO operational systems. You get data OUT OF warehouses.
The 4-Step Design Process¶
Every dimensional model follows these four steps (Kimball Ch 2-3):
Step 1: Select the Business Process¶
- What activity generates measurable events? (e.g., sales, enrollments, appointments)
- Business processes are verbs: selling, ordering, registering, billing
- ❌ NOT a department ("marketing") — that's who uses the data, not what generates it
Step 2: Declare the Grain¶
- "What does one row in the fact table represent?"
- Express in business terms: "One row per product sold per transaction"
- Always start with the finest (atomic) grain — you can always roll up, but can't drill down from summaries
- ⚠️ Most common design mistake: Skipping this step or mixing grains in one table
Step 3: Identify the Dimensions¶
- Ask: "Who, what, where, when, why, how?" for each event
- Common dimensions: Date, Product, Customer, Employee, Location, Promotion
- Each dimension takes on exactly one value per fact row
- If a dimension would cause extra rows → wrong grain or wrong dimension
Step 4: Identify the Facts¶
- Ask: "What is the process measuring?"
- Facts are numeric values you want to SUM, AVG, COUNT
- Must be true to the grain declared in Step 2
- If a fact belongs to a different grain → put it in a separate fact table
Star Schema Anatomy¶
┌──────────────┐
│ dim_date │ ← DIMENSION: descriptive context
│──────────────│ (wide, denormalized, ~thousands of rows)
│ date_key PK │
│ full_date │
│ day_of_week │
│ month, qtr │
│ year │
│ is_holiday │
└──────┬───────┘
│
┌──────────────┐ ┌──────┴──────────────┐ ┌──────────────┐
│ dim_product │ │ fact_sales │ │ dim_store │
│──────────────│ │─────────────────────│ │──────────────│
│ product_key ├──┤ date_key FK ├──┤ store_key │
│ product_name │ │ product_key FK │ │ store_name │
│ brand │ │ store_key FK │ │ city, state │
│ category │ │ promotion_key FK │ │ sq_footage │
│ department │ │ │ └──────────────┘
│ package_type │ │ ── MEASURES ── │
└──────────────┘ │ sales_quantity │ ← FACTS: numeric, additive
│ unit_price │
│ extended_amount │
│ extended_cost │
│ gross_profit │
└─────────────────────┘
↑
FACT TABLE: events/measurements
(narrow, very tall — millions/billions of rows)
Fact Table Rules¶
| Rule | Details |
|---|---|
| One row = one event | Each row represents a real-world measurement event |
| Foreign keys to dimensions | One FK column per dimension |
| Numeric measures | SUM-able, AVG-able numbers |
| Grain must be uniform | Never mix different grains in the same table |
| No text in fact tables | Descriptive text belongs in dimensions |
Three Types of Facts¶
| Type | Can SUM across... | Example |
|---|---|---|
| Additive | ALL dimensions | Revenue, quantity, cost |
| Semi-additive | Some dimensions (not time) | Account balance, inventory level |
| Non-additive | NO dimensions (compute from components) | Unit price, ratios, percentages |
💡 Pro tip: Store the additive components (numerator & denominator), then compute ratios in your query or BI tool. "Ratio of the sums, not sum of the ratios."
Three Types of Fact Tables¶
| Type | What it captures | Row behavior |
|---|---|---|
| Transaction | Individual events (a sale, a click) | Inserted once, never updated |
| Periodic Snapshot | State at regular intervals (daily balance) | One row per period, even if nothing happened |
| Accumulating Snapshot | Pipeline milestones (order → ship → deliver) | Row is updated as milestones are reached |
Factless Fact Tables¶
- Fact table with no numeric measures — just foreign keys
- Use case 1: Event tracking — "Student attended class on this date" (no number, but the event matters)
- Use case 2: Coverage analysis — "What didn't happen?" (e.g., which products were on promotion but had zero sales?)
Dimension Table Rules¶
| Rule | Details |
|---|---|
| Single primary key | Use a surrogate key (integer, assigned sequentially) |
| Wide and flat | Many descriptive columns, denormalized |
| Text-heavy | Verbose descriptions, not codes |
| Denormalized | Flatten hierarchies into one table (e.g., product + category + department all in dim_product) |
| Relatively small | Thousands to low millions of rows |
Why Surrogate Keys?¶
Don't use the business key (e.g., employee_id) as the dimension PK because:
- Business keys change (employee rehired with new ID)
- Multiple source systems may have conflicting keys
- You need multiple rows for the same entity when tracking changes (SCD Type 2)
- Surrogate keys are simpler and uniformly formatted
Exception: The dim_date table can use YYYYMMDD integer as its key (it's stable and meaningful).
The Date Dimension¶
Every star schema needs one. Pre-populate it with 10-20 years of days (~7,300 rows for 20 years).
Key columns:
| Column | Example | Why? |
|---|---|---|
| date_key | 20260318 | PK (YYYYMMDD format) |
| full_date | 2026-03-18 | Actual DATE type |
| day_of_week | Wednesday | Compare Mon vs Sun sales |
| month_name | March | GROUP BY month |
| quarter | Q1 | Quarterly reports |
| year | 2026 | Year-over-year analysis |
| is_weekend | Y/N | Filter weekdays/weekends |
| is_holiday | Y/N | Holiday sales patterns |
| fiscal_quarter | FQ4 | If fiscal ≠ calendar year |
💡 Why not just use a DATE column in the fact table? Because you can't easily GROUP BY "is_holiday" or "fiscal_quarter" with a raw date. The dimension pre-computes these attributes.
Slowly Changing Dimensions (SCD)¶
What happens when dimension data changes? (e.g., employee changes department, customer moves to new city)
| Type | Strategy | Tradeoff |
|---|---|---|
| Type 0 | Retain original value forever | No history tracking |
| Type 1 | Overwrite old value | Lose history, but always current |
| Type 2 | Add new row with new surrogate key | Most common. Full history preserved. Requires effective/expiration dates. |
| Type 3 | Add "previous" column | Only tracks one change (current + previous) |
Type 2 example: Doctor changes department
| doctor_key | doctor_id | name | department | effective_date | expiration_date | current |
|---|---|---|---|---|---|---|
| 101 | D-5432 | Dr. Park | Cardiology | 2020-01-15 | 2025-06-30 | N |
| 247 | D-5432 | Dr. Park | Neurology | 2025-07-01 | 9999-12-31 | Y |
Note: Two surrogate keys (101, 247) for the same doctor. Historical facts link to key 101 (Cardiology era), new facts link to key 247 (Neurology era).
Star vs Snowflake¶
| Star Schema | Snowflake Schema | |
|---|---|---|
| Dimensions | Denormalized (flat) | Normalized (broken into sub-tables) |
| Joins | Fewer (fact → dimension) | More (fact → dimension → sub-dimension) |
| Query speed | Faster ✅ | Slower ❌ |
| Storage | More (redundancy) | Less |
| Ease of use | Simpler ✅ | More complex |
| Kimball's recommendation | Preferred | Use sparingly |
STAR: SNOWFLAKE:
dim_product dim_product
├── product_name ├── product_name
├── brand_name ←flattened ├── brand_key → dim_brand
├── category_name ←flattened ├── category_key → dim_category
├── department_name ←flattened └── ...
└── ... dim_brand
├── brand_name
└── brand_country
💡 Kimball strongly recommends star over snowflake for analysis. Snowflaking saves storage but adds JOIN complexity and slows queries. The storage savings are negligible compared to the massive fact table.
Estimating Fact Table Size¶
Quick estimation formula:
Total rows = Product of unique values across dimensions
× Fill rate (what % of combinations actually occur)
Example (Retail):
1,000 stores × 60,000 products × 365 days × 10% fill rate
= ~2.2 billion rows/year
Bytes = Rows × Columns × Avg bytes/column
ETL: Extract, Transform, Load¶
The process of moving data from OLTP → Star Schema:
EXTRACT → TRANSFORM → LOAD
Read from Clean & reshape Insert into
source systems • Deduplicate star schema
• Standardize formats • Dimensions FIRST
• Derive new fields • Facts SECOND
• Handle NULLs • Index & optimize
• Map surrogate keys
Key ETL principles: 1. Load dimensions before facts (facts need dimension FKs) 2. Handle data quality — inconsistent formats, missing values, duplicates 3. Assign surrogate keys during dimension loads 4. Track changes — implement SCD logic during dimension updates 5. Initial load vs incremental — first load is full extract; subsequent loads capture only changes (CDC = Changed Data Capture)
10 Essential Rules for Dimensional Modeling¶
(From Kimball, Chapter 2)
- Use atomic facts — Start with the finest grain
- Create single-process fact tables — One business process per fact table
- Include a date dimension for every fact table
- Enforce consistent grain — Don't mix grains
- Disallow null foreign keys in fact tables (use "Unknown" dimension row instead)
- Honor hierarchies in dimensions (flatten them, don't snowflake)
- Decode dimension tables — Use descriptive text, not codes
- Use surrogate keys for all dimensions (except date)
- Conform dimensions — Same dimension means same thing everywhere
- Balance requirements with actual data — Design from both business needs AND data realities
Quick Reference: Vocabulary¶
| Term | Definition |
|---|---|
| Grain | What one row in the fact table represents |
| Measure / Fact | A numeric value to aggregate (SUM, AVG) |
| Dimension | Descriptive context (who, what, where, when) |
| Surrogate key | Artificial integer PK for dimensions |
| Natural key | The original business key (e.g., SSN, employee_id) |
| Degenerate dimension | A dimension with no table — just a value in the fact table (e.g., invoice #) |
| Conformed dimension | A dimension shared across multiple fact tables with identical meaning |
| Additive fact | Can be summed across all dimensions |
| Semi-additive fact | Can be summed across some dimensions (not time) |
| Non-additive fact | Cannot be summed — must compute from components |
| SCD | Slowly Changing Dimension — strategy for handling dimension changes |
| Snowflake | Normalizing dimension tables (generally avoided) |
| Bus matrix | Enterprise-wide plan showing which dimensions apply to which business processes |
Kimball Reading Guide for IS480¶
| Week | Reading | Key Takeaway |
|---|---|---|
| 10 | Ch 1 (pp. 1-35) | Why warehouses exist, star schema basics, DW architecture |
| 11 | Ch 2 (pp. 37-68) | Complete technique reference — 4-step process, all fact/dimension types |
| 11 | Ch 3 (pp. 69-100) | Retail case study — see the 4 steps applied to a real business |
This cheat sheet covers the essential Kimball concepts for IS480. For the complete treatment with industry case studies, refer to the full textbook.