When to Model Your Data (and When Not To) | Data Engineering Guide

Having been part of teams building data platforms at different stages of maturity, I’ve noticed the ones that struggled most had one thing in common. They either skipped modeling entirely or over-engineered it from day one. Both ended up rebuilding.

The teams that skipped modeling moved fast initially. Flat tables, ad-hoc queries, everyone writing their own joins. It worked until it didn’t:

The third dashboard defined “revenue” differently from the first two
A historical report showed the wrong numbers because someone updated a dimension value in the source
A new hire spent two weeks working out which of several overlapping tables was the real one — each built by a different team, none conformed

The teams that over-engineered it spent months building a pristine dimensional model before anyone could ask a question. That also broke down, in its own way:

Analysts sat idle waiting for the model to be ready before they could answer the first round of questions
A single new attribute rippled through conformed dimensions, SCD history, every joining fact, and the tests behind them. What takes minutes in OBT took a release cycle
By the time the model shipped, the business had moved on and the assumptions it encoded no longer matched reality

The question “should I model my data?” used to be about query performance. Joins were expensive. Denormalization saved compute. That argument has weakened. Modern engines (Spark, DuckDB, Databricks, Snowflake, BigQuery) handle complex joins at scale far more cheaply than they once did.

But the core question remains: when is the investment in modeling worth it?

The framework I keep coming back to is this:

Model when the cost of a wrong answer exceeds the cost of modeling.

The Landscape

There are several established approaches to data modeling. I’ve primarily worked with Kimball star schemas and One Big Table (OBT), so that’s where the deep analysis in this post comes from.

Kimball (Star Schema)

Organize data around business processes. Each process gets a fact table at a declared grain, where one row equals one event or transaction. Shared lookup data lives in dimension tables. Facts and dimensions join via surrogate keys. The star schema was designed for BI consumption, delivering consistent metrics across teams, clear business processes, and self-serve analytics.

One Big Table (OBT)

Denormalize everything into one wide table per business process. Pre-join all dimensions into the fact. OBT suits small teams, few sources, and rapid prototyping. It trades storage and redundancy for query simplicity and no joins at read time, which holds up best when one team owns both the pipeline and the questions, and the cost of a wrong answer is low.

Other Approaches

Inmon (3NF) builds one normalized enterprise warehouse first, then denormalized marts. Data Vault separates business keys, attributes, and relationships for full auditability. Activity Schema keeps raw events with minimal transformation. Each has its place: Inmon for large enterprise integration, Data Vault for regulated industries, Activity Schema for event-heavy products. I haven’t built production systems with them, so I’ll stay in the lane of what I’ve built.

A Car Rental Example

Let’s take a car rental company. Customers book vehicles online or at the counter, pick up from various locations, and return after one or more days. Pricing varies by vehicle class, day of week, and season. Bookings come through multiple channels: the company’s website, aggregators like Kayak or Google, corporate portals, and walk-in counter reservations.

The Starting Point: One Big Table

You start with the simplest thing that works. One wide table with every rental attribute denormalized. No joins. Fast to build. Anyone can query it immediately.

rentals_obt (one row per rental-day, dozens of columns, showing a subset):

rental_id	customer_name	vehicle_class	condition_grade	channel	rental_amount	insurance_fee	days	is_cancelled	rental_date	…
R001	Alice	Premium SUV	A	hertz.com	450	45	3	false	2026-01-10	…
R001	Alice	Premium SUV	A	hertz.com	450	45	3	false	2026-01-11	…
R001	Alice	Premium SUV	A	hertz.com	450	45	3	false	2026-01-12	…
R002	Bob	Compact	B	kayak	200	20	2	true	2026-01-15	…
R002	Bob	Compact	B	kayak	200	20	2	true	2026-01-16	…

Notice that rental R001 appears three times (one per day). Every column (customer name, channel, insurance fee) is repeated on each row. The vehicle was reclassified from “Standard SUV” to “Premium SUV” after a fleet review, but every row shows the current classification.

This works. For a while, it works really well. OBT deserves credit here. For small teams with few sources and a stable domain, it can remain the right choice indefinitely. The signal to switch is specific pain, not elapsed time.

Where It Breaks

Historical accuracy. The fleet team reclassifies a vehicle from “Standard SUV” to “Premium SUV” after upgrading its trim package. What happens in the OBT depends on how it’s built. If it’s a full-refresh, every historical rental now shows the premium classification, and last quarter’s revenue-by-class reports silently change. If it’s incremental, old rows keep “Standard SUV” and new rows get “Premium SUV,” so the same vehicle appears with two different classes in the same table. Either way, there’s no version tracking.

The fix: Type 2 Slowly Changing Dimension (SCD). Track vehicle versions over time with validity windows. Each row carries row_valid_from and row_valid_to, so you always know which attributes were in effect at any point in time:

SELECT
  r.rental_id,
  r.pickup_at,
  v.vehicle_class,
  v.condition_grade
FROM fct_rentals AS r
LEFT JOIN dim_vehicle_history AS v
  ON r.vehicle_id = v.vehicle_id
  AND r.pickup_at >= v.row_valid_from
  AND r.pickup_at < v.row_valid_to

This temporal join gives you the vehicle’s classification at the time the rental started. The fleet team can reclassify a vehicle ten times. Each rental still references the correct historical version.

Multi-grain analysis. Operations wants rentals by channel this month (how many reservations came through each source). Finance wants revenue broken down by individual rental-days (daily yield by vehicle class). Same underlying data, but the grain is different. Operations needs one row per rental. Finance needs one row per rental-day.

An OBT forces you to pick one grain. If you go with the day grain, operations’ COUNT(*) returns 3x the actual rental count for a 3-day rental. They can work around this with COUNT(DISTINCT rental_id), but every query, every dashboard, every analyst has to remember to do it. Get it wrong once and the numbers are silently inflated. If you go with the rental grain, finance can’t see daily rate variations or compute utilization rates.

The fix: separate fact tables at declared grains. Each fact table answers one question, at one grain, unambiguously:

fct_rentals: one row per rental reservation (rental grain)
fct_rental_days: one row per rental-day (day grain)

Rental-level attributes like is_cancelled are denormalized into each rental-day row so you can filter at both grains. This is a deliberate pragmatic choice, not strict Kimball: the rental-level flag belongs on fct_rentals, but pushing a small number of such attributes onto the day-grain fact saves every downstream query from a join back to the reservation fact. Keep the list short and documented, or the day fact starts drifting toward another OBT.

Conformed dimensions across business processes. You have rentals, cancellations, and maintenance events. Three different business processes. All reference the same vehicle and location.

With OBT, you duplicate vehicle and location attributes across three wide tables. A vehicle gets reclassified from Standard to Premium. The rentals table gets refreshed in the next pipeline run, but the maintenance and cancellations tables don’t. Now three tables disagree on the same vehicle’s classification.

The fix: conformed dimensions. Define the dimension once; reference it from every fact table via a surrogate key. dim_vehicle_history lives in one place, and rentals, cancellations, and maintenance all join to the same table. A vehicle’s attributes (class, condition grade, license plate) have one definition, one source of truth, no drift.

Metric consistency. Three dashboards define “revenue” differently. One includes insurance fees. One excludes cancelled rentals. One mixes currencies because international locations report in local currency and nobody realized the amount column wasn’t converted.

No single source of truth. Each team writes their own CASE WHEN. The numbers don’t match. Leadership loses trust in the data.

The fix: governed measures on the fact table (and optionally a Semantic Layer on top). Declare measures once (rental_amount, insurance_fee, days) with documented definitions in the schema. Every downstream consumer reads the same column, with the same meaning.

Currency handling is a design decision of its own. The canonical dimensional pattern is to carry both amounts in the fact: local (the transaction as it happened) and standard (for cross-entity rollups), along with audit fields like exchange_rate_applied and exchange_rate_date. Where conversion runs (Silver, Gold, or a rate dimension) is a separate choice. The principle is that it happens once, with a governed rate source, not ad-hoc in every dashboard.

A Semantic Layer takes this further by encoding the full business formula (revenue net of fees, cancelled rentals excluded) as a named metric. That’s the focus of the next post.

What We Ended Up With

Each of those fixes points at the same architecture: a medallion architecture with Kimball at the gold layer.

Bronze: raw ingestion, one-to-one with source tables
Silver: conforming, cleansing, currency conversion, SCD tracking
Gold: dimensional model with fact tables at declared grains, conformed dimensions, and surrogate keys

The star schema at the gold layer. Each fact gets its own star. dim_vehicle_history, dim_customer, and dim_channel appear in both stars. They are the same physical tables, referenced by both facts via identical surrogate keys. The bus matrix below makes that explicit.

Operations view: fct_rentals (one row per reservation)

erDiagram
    dim_vehicle_history ||--o{ fct_rentals : "vehicle_key_sk"
    dim_customer        ||--o{ fct_rentals : "customer_key_sk"
    dim_channel         ||--o{ fct_rentals : "channel_key_sk"
    dim_pickup_date     ||--o{ fct_rentals : "pickup_date_key_sk"

Finance view: fct_rental_days (one row per rental-day)

erDiagram
    dim_vehicle_history ||--o{ fct_rental_days : "vehicle_key_sk"
    dim_customer        ||--o{ fct_rental_days : "customer_key_sk"
    dim_channel         ||--o{ fct_rental_days : "channel_key_sk"
    dim_rental_date     ||--o{ fct_rental_days : "rental_date_key_sk"

The Kimball bus matrix, the canonical artifact for showing which dimensions are conformed across which facts:

Fact table	`dim_vehicle_history`	`dim_customer`	`dim_channel`	`dim_pickup_date`	`dim_rental_date`
`fct_rentals`	✓	✓	✓	✓
`fct_rental_days`	✓	✓	✓		✓

The first three columns are conformed: same surrogate keys, same attribute definitions, same Type 2 history, across both facts. The date dimensions differ because each fact grain anchors to a different event. fct_rentals anchors to the pickup date (when the reservation starts); fct_rental_days anchors to the individual rental date (one row per day consumed).

Here’s what each table contains:

fct_rentals (one row per rental):

rental_key_sk	vehicle_key_sk	customer_key_sk	channel_key_sk	pickup_date_key_sk	pickup_at	rental_amount	insurance_fee	days	is_cancelled
sk_001	sk_veh_01v2	sk_cst_01	sk_ch_01	20260110	2026-01-10 09:15:00	450	45	3	false
sk_002	sk_veh_02v1	sk_cst_02	sk_ch_02	20260115	2026-01-15 14:30:00	200	20	2	true

fct_rental_days (one row per rental-day):

rental_day_key_sk	vehicle_key_sk	customer_key_sk	channel_key_sk	rental_date_key_sk	daily_rate	is_cancelled
sk_rd_001	sk_veh_01v2	sk_cst_01	sk_ch_01	20260110	135	false
sk_rd_002	sk_veh_01v2	sk_cst_01	sk_ch_01	20260111	135	false
sk_rd_003	sk_veh_01v2	sk_cst_01	sk_ch_01	20260112	135	false

daily_rate is the per-day rental amount, net of the insurance fee: (rental_amount - insurance_fee) / days = (450 - 45) / 3 = 135. Whether ADR is computed net or gross of fees is a finance definition, not an arithmetic truth; here we follow the convention that fees are excluded. This is the component used to compute average daily rental rate.

dim_vehicle_history (Type 2 SCD, tracks changes over time):

vehicle_key_sk	vehicle_id	vehicle_class	condition_grade	row_valid_from	row_valid_to
sk_veh_01v1	V100	Standard SUV	B	2025-06-01	2026-01-01
sk_veh_01v2	V100	Premium SUV	A	2026-01-01	9999-12-31
sk_veh_02v1	V200	Compact	B	2025-01-01	9999-12-31

dim_customer, dim_channel, dim_pickup_date, dim_rental_date: shared across both fact tables via surrogate keys. Table contents omitted for brevity; they follow the standard pattern of one row per business entity with a surrogate key and descriptive attributes.

Rental R001 now appears once in fct_rentals and three times in fct_rental_days. Each table serves its grain. The vehicle’s old classification (“Standard SUV”) is preserved in dim_vehicle_history alongside the current classification, so the temporal join returns the correct version for each rental.

The two approaches aren’t in competition. Medallion is a data flow pattern. Kimball is a modeling pattern. They’re complementary. Bronze and silver handle the “getting data clean” problem. Gold handles the “making data usable” problem.

The Framework: When to Model, When Not To

You Probably Don’t Need Formal Modeling When:

A single team owns both the pipeline and the questions
Questions are still in flux, you’re in discovery
Work is throwaway or time-boxed: POC, hackathon, one-off analysis
Cost of a wrong answer is low: internal dashboards that get eyeballed, not acted upon

You Probably Do Need Modeling When:

Multiple sources need to be conformed into a shared view
Multiple teams consume the same data, and metric consistency is non-negotiable
History must be preserved: auditing, compliance, point-in-time correctness
Cost of a wrong answer is high: financial reporting, customer-facing metrics, regulatory submissions

A word on what Kimball costs. The failure modes above make a strong case for modeling. But the framework only works if both sides of the trade-off are honest. Modeling costs aren’t one-time. They recur with every schema change, every new source, and every shift in business definition:

Dimension	Kimball (star schema)	OBT
Initial build	High: design facts, dimensions, surrogate keys, conformance rules	Low: one wide table
Schema change	High: coordinated update across facts, dimensions, downstream models, potential backfill	Low: add a column, refresh the table
New source onboarding	Requires conforming to shared dimensions across teams	Independent table per source
Metric consistency	Enforced by structure and declared measures	Enforced by convention (if at all)
Historical accuracy	Handled via Type 2 SCDs, at the cost of extra tables and join logic	Manual, or lost

OBT’s real advantage is that schema changes stay cheap throughout the platform’s life, not just on day one. For a fast-moving product or an early-stage platform, that ongoing flexibility is a legitimate reason to stay on OBT longer than the failure modes above might suggest. The framework question remains: does the cost of a wrong answer exceed the cost of modeling? Cost of modeling includes ongoing maintenance, not just the initial build.

What’s Next

This post covers the “do I model, and how do I choose?” question. But there’s a growing argument that changes the calculus entirely: AI.

In the next post, I’ll explore why AI makes the case for modeling stronger, not weaker: how AI agents interact with your data at three distinct levels, the failure modes that emerge when structure is missing, and what a Semantic Layer adds on top of a dimensional model. The short version: as AI moves from assisting humans to acting autonomously, the cost of a wrong answer escalates, and modeling is what constrains the AI’s interpretation space.

If you’ve faced this decision on your team (whether you chose to model or chose not to), I’d like to hear what drove the choice. The best frameworks come from shared experience, not theory.