Steven's Knowledge

Data Modeling

Star schema, snowflake schema, data vault 2.0, slowly changing dimensions, normalization vs denormalization, and Kimball vs Inmon methodology.

Data Modeling

Data modeling for analytics determines how data is structured for querying and reporting. The right model balances query performance, flexibility, and maintainability. This article covers the major approaches used in modern data warehouses.

Star Schema

The star schema is the most widely used dimensional model. A central fact table stores measurable events, surrounded by dimension tables that provide descriptive context.

-- Fact table: one row per order line item
CREATE TABLE fact_order_lines (
    order_line_id     BIGINT PRIMARY KEY,
    order_id          BIGINT NOT NULL,
    customer_key      INT REFERENCES dim_customers(customer_key),
    product_key       INT REFERENCES dim_products(product_key),
    date_key          INT REFERENCES dim_dates(date_key),
    store_key         INT REFERENCES dim_stores(store_key),
    quantity          INT NOT NULL,
    unit_price        DECIMAL(10, 2) NOT NULL,
    discount_amount   DECIMAL(10, 2) DEFAULT 0,
    total_amount      DECIMAL(12, 2) NOT NULL
);

-- Dimension table: descriptive attributes about customers
CREATE TABLE dim_customers (
    customer_key      INT PRIMARY KEY,  -- Surrogate key
    customer_id       VARCHAR(50) NOT NULL,  -- Natural key
    name              VARCHAR(200),
    email             VARCHAR(200),
    segment           VARCHAR(50),
    city              VARCHAR(100),
    country           VARCHAR(100),
    created_at        TIMESTAMP,
    effective_from    DATE NOT NULL,
    effective_to      DATE,
    is_current        BOOLEAN DEFAULT TRUE
);

-- Dimension table: calendar dates
CREATE TABLE dim_dates (
    date_key          INT PRIMARY KEY,
    full_date         DATE NOT NULL,
    year              INT,
    quarter           INT,
    month             INT,
    month_name        VARCHAR(20),
    week_of_year      INT,
    day_of_week       INT,
    day_name          VARCHAR(20),
    is_weekend        BOOLEAN,
    is_holiday        BOOLEAN,
    fiscal_year       INT,
    fiscal_quarter    INT
);

Why Star Schema Works

  • Simple joins: Queries only join fact to dimension (no multi-hop joins)
  • Predictable performance: Query patterns are well-understood by optimizers
  • Business-friendly: Dimensions map to natural business concepts
-- Typical star schema query: monthly revenue by product category
SELECT
    d.year,
    d.month_name,
    p.category,
    SUM(f.total_amount) AS revenue,
    COUNT(DISTINCT f.order_id) AS order_count
FROM fact_order_lines f
JOIN dim_dates d ON f.date_key = d.date_key
JOIN dim_products p ON f.product_key = p.product_key
WHERE d.year = 2025
GROUP BY d.year, d.month_name, p.category
ORDER BY d.year, d.month, p.category;

Snowflake Schema

The snowflake schema normalizes dimension tables into sub-dimensions. For example, instead of storing category_name directly on dim_products, you create a separate dim_categories table.

-- Snowflake: normalized product dimension
CREATE TABLE dim_categories (
    category_key   INT PRIMARY KEY,
    category_name  VARCHAR(100),
    department     VARCHAR(100)
);

CREATE TABLE dim_products (
    product_key    INT PRIMARY KEY,
    product_id     VARCHAR(50),
    product_name   VARCHAR(200),
    category_key   INT REFERENCES dim_categories(category_key),
    brand          VARCHAR(100),
    unit_cost      DECIMAL(10, 2)
);

Star vs Snowflake

AspectStar SchemaSnowflake Schema
Dimension structureDenormalized (flat)Normalized (multiple tables)
Query complexitySimple (fewer joins)More joins required
StorageMore redundancyLess redundancy
Query performanceGenerally fasterCan be slower (more joins)
MaintenanceEasier to understandMore complex
Best forBI tools, ad hoc queriesLarge dimensions with shared attributes

In practice, star schema is preferred for most analytics use cases. Modern warehouses handle the extra storage cheaply, and simpler queries win.

Data Vault 2.0

Data Vault is a modeling methodology designed for large-scale, auditable data warehouses that integrate data from many sources. It separates structure (hubs, links) from context (satellites).

Core Components

-- Hub: business keys (unique business entities)
CREATE TABLE hub_customer (
    hub_customer_hk    CHAR(32) PRIMARY KEY,  -- Hash of business key
    customer_bk        VARCHAR(50) NOT NULL,   -- Business key
    load_date          TIMESTAMP NOT NULL,
    record_source      VARCHAR(100) NOT NULL
);

-- Satellite: descriptive attributes (change-tracked)
CREATE TABLE sat_customer_details (
    hub_customer_hk    CHAR(32) REFERENCES hub_customer(hub_customer_hk),
    load_date          TIMESTAMP NOT NULL,
    load_end_date      TIMESTAMP,
    record_source      VARCHAR(100),
    name               VARCHAR(200),
    email              VARCHAR(200),
    segment            VARCHAR(50),
    hash_diff          CHAR(32),  -- Hash of all attributes for change detection
    PRIMARY KEY (hub_customer_hk, load_date)
);

-- Link: relationships between business entities
CREATE TABLE link_order_customer (
    link_order_cust_hk CHAR(32) PRIMARY KEY,
    hub_order_hk       CHAR(32) REFERENCES hub_order(hub_order_hk),
    hub_customer_hk    CHAR(32) REFERENCES hub_customer(hub_customer_hk),
    load_date          TIMESTAMP NOT NULL,
    record_source      VARCHAR(100) NOT NULL
);

When to Use Data Vault

  • Multiple source systems with conflicting definitions
  • Regulatory requirements for full auditability
  • Frequent source system changes
  • Large enterprises with complex data landscapes

Data Vault is typically used as a raw/integration layer, with a star schema built on top for reporting.

Slowly Changing Dimensions (SCD)

When dimension attributes change over time, how do you track those changes?

SCD Type 1: Overwrite

Simply overwrite the old value. No history is kept.

-- SCD Type 1: just update the record
UPDATE dim_customers
SET email = 'newemail@example.com',
    updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 'C-1001';

Use when: History does not matter (e.g., fixing a typo).

SCD Type 2: Add New Row

Create a new row for each change, preserving full history.

-- SCD Type 2: expire old row, insert new row
-- Step 1: Close the current record
UPDATE dim_customers
SET effective_to = CURRENT_DATE - INTERVAL '1 day',
    is_current = FALSE
WHERE customer_id = 'C-1001' AND is_current = TRUE;

-- Step 2: Insert new version
INSERT INTO dim_customers (
    customer_key, customer_id, name, email, segment,
    city, country, created_at, effective_from, effective_to, is_current
) VALUES (
    NEXT VALUE FOR customer_key_seq,
    'C-1001', 'Alice Smith', 'newemail@example.com', 'Premium',
    'Auckland', 'New Zealand', '2023-01-15',
    CURRENT_DATE, NULL, TRUE
);

Use when: You need to track how attributes changed over time (most common).

SCD Type 3: Add New Column

Add a column to store the previous value.

-- SCD Type 3: store previous and current
ALTER TABLE dim_customers ADD COLUMN previous_segment VARCHAR(50);

UPDATE dim_customers
SET previous_segment = segment,
    segment = 'Premium',
    segment_changed_at = CURRENT_TIMESTAMP
WHERE customer_id = 'C-1001';

Use when: You only care about the immediately previous value.

SCD Type 4: Separate History Table

Keep the current dimension clean, with a separate table for history.

-- SCD Type 4: current dimension + history table
-- dim_customers contains only current records
-- dim_customers_history contains all historical versions

INSERT INTO dim_customers_history
SELECT *, CURRENT_TIMESTAMP AS archived_at
FROM dim_customers
WHERE customer_id = 'C-1001';

UPDATE dim_customers
SET segment = 'Premium', updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 'C-1001';

Use when: Query performance on current data is critical and history is rarely accessed.

Normalization vs Denormalization

For OLTP (Transactional Systems)

Normalize to 3NF to minimize redundancy and maintain data integrity:

-- 3NF: each fact stored once
-- orders -> order_lines -> products -> categories
-- No redundant data, enforced by foreign keys

For OLAP (Analytical Systems)

Denormalize for query performance and simplicity:

-- Denormalized analytical table
CREATE TABLE analytics.wide_orders AS
SELECT
    o.order_id,
    o.order_date,
    c.name AS customer_name,
    c.segment AS customer_segment,
    c.country AS customer_country,
    p.product_name,
    p.category AS product_category,
    p.brand AS product_brand,
    ol.quantity,
    ol.unit_price,
    ol.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_lines ol ON o.order_id = ol.order_id
JOIN products p ON ol.product_id = p.product_id;

Kimball vs Inmon

Two foundational methodologies define how enterprise data warehouses are built.

Kimball (Bottom-Up)

Build dimensional models (star schemas) organized by business process. The warehouse is a collection of conformed dimension models.

Source → Staging → Dimensional Models (star schemas) → BI Tools
                   ├── Sales mart
                   ├── Marketing mart
                   └── Finance mart

Key principles:

  • Business process-oriented
  • Conformed dimensions shared across marts
  • Faster to deliver value
  • Dimensional modeling (facts + dimensions)

Inmon (Top-Down)

Build a normalized enterprise data warehouse first, then derive dimensional marts.

Source → Staging → Enterprise DW (3NF) → Dimensional Marts → BI Tools
                   (single source of truth)

Key principles:

  • Enterprise-wide integrated model
  • Normalized (3NF) central warehouse
  • Data marts derived from the warehouse
  • More upfront design effort

Comparison

AspectKimballInmon
ApproachBottom-upTop-down
Central storeDimensional modelsNormalized 3NF
Time to valueFaster (build one mart at a time)Slower (enterprise model first)
ComplexityLower initial complexityHigher initial complexity
FlexibilityHarder to refactor laterEasier to add new subjects
Modern relevancedbt + star schemasData vault + presentation layer

Modern Pragmatic Approach

Most modern teams take a hybrid approach:

-- Layer 1: Staging (raw, source-aligned)
-- stg_stripe_payments, stg_shopify_orders

-- Layer 2: Intermediate (business logic, cleaned)
-- int_payments_mapped, int_orders_enriched

-- Layer 3: Marts (dimensional, consumer-facing)
-- fct_orders, dim_customers, dim_products

This three-layer pattern, popularized by dbt, takes the best of both worlds: the business-process focus of Kimball with the clear separation and auditability that Inmon advocates.

Choosing the Right Model

ScenarioRecommended Model
BI dashboards and ad hoc queriesStar schema (Kimball)
Multi-source enterprise integrationData Vault + star schema
Real-time analyticsDenormalized wide tables
ML feature storeWide, denormalized feature tables
Regulatory compliance and auditData Vault 2.0
Small team, fast iterationSimplified star schema via dbt

The right model depends on your team size, data complexity, compliance needs, and how quickly you need to deliver value. Start simple with star schemas and evolve toward Data Vault only when the complexity demands it.

On this page