Data Modeling
Star schema, snowflake schema, data vault 2.0, slowly changing dimensions, normalization vs denormalization, and Kimball vs Inmon methodology.
Data Modeling
Data modeling for analytics determines how data is structured for querying and reporting. The right model balances query performance, flexibility, and maintainability. This article covers the major approaches used in modern data warehouses.
Star Schema
The star schema is the most widely used dimensional model. A central fact table stores measurable events, surrounded by dimension tables that provide descriptive context.
-- Fact table: one row per order line item
CREATE TABLE fact_order_lines (
order_line_id BIGINT PRIMARY KEY,
order_id BIGINT NOT NULL,
customer_key INT REFERENCES dim_customers(customer_key),
product_key INT REFERENCES dim_products(product_key),
date_key INT REFERENCES dim_dates(date_key),
store_key INT REFERENCES dim_stores(store_key),
quantity INT NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL,
discount_amount DECIMAL(10, 2) DEFAULT 0,
total_amount DECIMAL(12, 2) NOT NULL
);
-- Dimension table: descriptive attributes about customers
CREATE TABLE dim_customers (
customer_key INT PRIMARY KEY, -- Surrogate key
customer_id VARCHAR(50) NOT NULL, -- Natural key
name VARCHAR(200),
email VARCHAR(200),
segment VARCHAR(50),
city VARCHAR(100),
country VARCHAR(100),
created_at TIMESTAMP,
effective_from DATE NOT NULL,
effective_to DATE,
is_current BOOLEAN DEFAULT TRUE
);
-- Dimension table: calendar dates
CREATE TABLE dim_dates (
date_key INT PRIMARY KEY,
full_date DATE NOT NULL,
year INT,
quarter INT,
month INT,
month_name VARCHAR(20),
week_of_year INT,
day_of_week INT,
day_name VARCHAR(20),
is_weekend BOOLEAN,
is_holiday BOOLEAN,
fiscal_year INT,
fiscal_quarter INT
);Why Star Schema Works
- Simple joins: Queries only join fact to dimension (no multi-hop joins)
- Predictable performance: Query patterns are well-understood by optimizers
- Business-friendly: Dimensions map to natural business concepts
-- Typical star schema query: monthly revenue by product category
SELECT
d.year,
d.month_name,
p.category,
SUM(f.total_amount) AS revenue,
COUNT(DISTINCT f.order_id) AS order_count
FROM fact_order_lines f
JOIN dim_dates d ON f.date_key = d.date_key
JOIN dim_products p ON f.product_key = p.product_key
WHERE d.year = 2025
GROUP BY d.year, d.month_name, p.category
ORDER BY d.year, d.month, p.category;Snowflake Schema
The snowflake schema normalizes dimension tables into sub-dimensions. For example, instead of storing category_name directly on dim_products, you create a separate dim_categories table.
-- Snowflake: normalized product dimension
CREATE TABLE dim_categories (
category_key INT PRIMARY KEY,
category_name VARCHAR(100),
department VARCHAR(100)
);
CREATE TABLE dim_products (
product_key INT PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(200),
category_key INT REFERENCES dim_categories(category_key),
brand VARCHAR(100),
unit_cost DECIMAL(10, 2)
);Star vs Snowflake
| Aspect | Star Schema | Snowflake Schema |
|---|---|---|
| Dimension structure | Denormalized (flat) | Normalized (multiple tables) |
| Query complexity | Simple (fewer joins) | More joins required |
| Storage | More redundancy | Less redundancy |
| Query performance | Generally faster | Can be slower (more joins) |
| Maintenance | Easier to understand | More complex |
| Best for | BI tools, ad hoc queries | Large dimensions with shared attributes |
In practice, star schema is preferred for most analytics use cases. Modern warehouses handle the extra storage cheaply, and simpler queries win.
Data Vault 2.0
Data Vault is a modeling methodology designed for large-scale, auditable data warehouses that integrate data from many sources. It separates structure (hubs, links) from context (satellites).
Core Components
-- Hub: business keys (unique business entities)
CREATE TABLE hub_customer (
hub_customer_hk CHAR(32) PRIMARY KEY, -- Hash of business key
customer_bk VARCHAR(50) NOT NULL, -- Business key
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL
);
-- Satellite: descriptive attributes (change-tracked)
CREATE TABLE sat_customer_details (
hub_customer_hk CHAR(32) REFERENCES hub_customer(hub_customer_hk),
load_date TIMESTAMP NOT NULL,
load_end_date TIMESTAMP,
record_source VARCHAR(100),
name VARCHAR(200),
email VARCHAR(200),
segment VARCHAR(50),
hash_diff CHAR(32), -- Hash of all attributes for change detection
PRIMARY KEY (hub_customer_hk, load_date)
);
-- Link: relationships between business entities
CREATE TABLE link_order_customer (
link_order_cust_hk CHAR(32) PRIMARY KEY,
hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
hub_customer_hk CHAR(32) REFERENCES hub_customer(hub_customer_hk),
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL
);When to Use Data Vault
- Multiple source systems with conflicting definitions
- Regulatory requirements for full auditability
- Frequent source system changes
- Large enterprises with complex data landscapes
Data Vault is typically used as a raw/integration layer, with a star schema built on top for reporting.
Slowly Changing Dimensions (SCD)
When dimension attributes change over time, how do you track those changes?
SCD Type 1: Overwrite
Simply overwrite the old value. No history is kept.
-- SCD Type 1: just update the record
UPDATE dim_customers
SET email = 'newemail@example.com',
updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 'C-1001';Use when: History does not matter (e.g., fixing a typo).
SCD Type 2: Add New Row
Create a new row for each change, preserving full history.
-- SCD Type 2: expire old row, insert new row
-- Step 1: Close the current record
UPDATE dim_customers
SET effective_to = CURRENT_DATE - INTERVAL '1 day',
is_current = FALSE
WHERE customer_id = 'C-1001' AND is_current = TRUE;
-- Step 2: Insert new version
INSERT INTO dim_customers (
customer_key, customer_id, name, email, segment,
city, country, created_at, effective_from, effective_to, is_current
) VALUES (
NEXT VALUE FOR customer_key_seq,
'C-1001', 'Alice Smith', 'newemail@example.com', 'Premium',
'Auckland', 'New Zealand', '2023-01-15',
CURRENT_DATE, NULL, TRUE
);Use when: You need to track how attributes changed over time (most common).
SCD Type 3: Add New Column
Add a column to store the previous value.
-- SCD Type 3: store previous and current
ALTER TABLE dim_customers ADD COLUMN previous_segment VARCHAR(50);
UPDATE dim_customers
SET previous_segment = segment,
segment = 'Premium',
segment_changed_at = CURRENT_TIMESTAMP
WHERE customer_id = 'C-1001';Use when: You only care about the immediately previous value.
SCD Type 4: Separate History Table
Keep the current dimension clean, with a separate table for history.
-- SCD Type 4: current dimension + history table
-- dim_customers contains only current records
-- dim_customers_history contains all historical versions
INSERT INTO dim_customers_history
SELECT *, CURRENT_TIMESTAMP AS archived_at
FROM dim_customers
WHERE customer_id = 'C-1001';
UPDATE dim_customers
SET segment = 'Premium', updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 'C-1001';Use when: Query performance on current data is critical and history is rarely accessed.
Normalization vs Denormalization
For OLTP (Transactional Systems)
Normalize to 3NF to minimize redundancy and maintain data integrity:
-- 3NF: each fact stored once
-- orders -> order_lines -> products -> categories
-- No redundant data, enforced by foreign keysFor OLAP (Analytical Systems)
Denormalize for query performance and simplicity:
-- Denormalized analytical table
CREATE TABLE analytics.wide_orders AS
SELECT
o.order_id,
o.order_date,
c.name AS customer_name,
c.segment AS customer_segment,
c.country AS customer_country,
p.product_name,
p.category AS product_category,
p.brand AS product_brand,
ol.quantity,
ol.unit_price,
ol.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_lines ol ON o.order_id = ol.order_id
JOIN products p ON ol.product_id = p.product_id;Kimball vs Inmon
Two foundational methodologies define how enterprise data warehouses are built.
Kimball (Bottom-Up)
Build dimensional models (star schemas) organized by business process. The warehouse is a collection of conformed dimension models.
Source → Staging → Dimensional Models (star schemas) → BI Tools
├── Sales mart
├── Marketing mart
└── Finance martKey principles:
- Business process-oriented
- Conformed dimensions shared across marts
- Faster to deliver value
- Dimensional modeling (facts + dimensions)
Inmon (Top-Down)
Build a normalized enterprise data warehouse first, then derive dimensional marts.
Source → Staging → Enterprise DW (3NF) → Dimensional Marts → BI Tools
(single source of truth)Key principles:
- Enterprise-wide integrated model
- Normalized (3NF) central warehouse
- Data marts derived from the warehouse
- More upfront design effort
Comparison
| Aspect | Kimball | Inmon |
|---|---|---|
| Approach | Bottom-up | Top-down |
| Central store | Dimensional models | Normalized 3NF |
| Time to value | Faster (build one mart at a time) | Slower (enterprise model first) |
| Complexity | Lower initial complexity | Higher initial complexity |
| Flexibility | Harder to refactor later | Easier to add new subjects |
| Modern relevance | dbt + star schemas | Data vault + presentation layer |
Modern Pragmatic Approach
Most modern teams take a hybrid approach:
-- Layer 1: Staging (raw, source-aligned)
-- stg_stripe_payments, stg_shopify_orders
-- Layer 2: Intermediate (business logic, cleaned)
-- int_payments_mapped, int_orders_enriched
-- Layer 3: Marts (dimensional, consumer-facing)
-- fct_orders, dim_customers, dim_productsThis three-layer pattern, popularized by dbt, takes the best of both worlds: the business-process focus of Kimball with the clear separation and auditability that Inmon advocates.
Choosing the Right Model
| Scenario | Recommended Model |
|---|---|
| BI dashboards and ad hoc queries | Star schema (Kimball) |
| Multi-source enterprise integration | Data Vault + star schema |
| Real-time analytics | Denormalized wide tables |
| ML feature store | Wide, denormalized feature tables |
| Regulatory compliance and audit | Data Vault 2.0 |
| Small team, fast iteration | Simplified star schema via dbt |
The right model depends on your team size, data complexity, compliance needs, and how quickly you need to deliver value. Start simple with star schemas and evolve toward Data Vault only when the complexity demands it.