logo

GCP Data Architect Series - Part II

Dimensional modeling in BigQuery follows classic data warehouse principles (Kimball-style) — but you optimize for cloud-native, columnar, massively parallel processing.

AdminFollow
5 minFeb 28, 2026
Views - 24
GCP Data Architect Series - Part II

1️⃣ Start With the Business, Not the Tables

Before touching BigQuery:

  • Identify business processes (orders, payments, shipments, marketing events, etc.)

  • Define grain clearly

    “One row represents what?”

Example:

  • Orders fact → 1 row per order line

  • Web events fact → 1 row per event

If grain is unclear, everything downstream breaks.


2️⃣ Design the Star Schema

In BigQuery, I strongly favor star schemas over snowflake schemas because:

  • Joins are cheap in BigQuery

  • Storage is inexpensive

  • Simpler models perform well

  • BI tools work better with stars

Typical Structure

  • Fact tables → metrics + foreign keys

  • Dimension tables → descriptive attributes

Example:

 
fact_orders
  - order_id
  - order_date_key
  - customer_key
  - product_key
  - quantity
  - revenue

dim_customer
dim_product
dim_date
 

3️⃣ Fact Table Design in BigQuery

Best Practices

✔ Define Grain Explicitly

Never mix grains in the same fact table.

✔ Use Surrogate Keys (When Needed)

BigQuery doesn’t require integer keys for performance, but they help:

  • With Slowly Changing Dimensions

  • With model stability

  • With BI tool consistency

✔ Partition Large Fact Tables

Use:

  • PARTITION BY DATE(order_timestamp)
    or

  • PARTITION BY order_date

This reduces scan costs significantly.

✔ Cluster Smartly

Cluster by:

  • Foreign keys

  • Frequently filtered columns

Example:

 
PARTITION BY order_date
CLUSTER BY customer_key, product_key
 

4️⃣ Dimension Table Design

Dimensions contain descriptive context.

Example:

  • Customer

  • Product

  • Location

  • Campaign

  • Sales rep

Slowly Changing Dimensions (SCD)

In BigQuery, I commonly implement:

? Type 1 (Overwrite)

For non-critical history.

? Type 2 (History Tracking)

Add:

  • effective_start_date

  • effective_end_date

  • is_current_flag

Because storage is cheap in BigQuery, Type 2 is very common.


5️⃣ Use Denormalization Strategically

BigQuery is columnar. So:

  • Wide tables are okay

  • Denormalized dimensions perform well

  • Avoid over-normalizing small dimensions

But:

  • Don’t duplicate large high-cardinality fields unnecessarily.


6️⃣ Handle BigQuery-Specific Considerations

? Nested & Repeated Fields

BigQuery supports nested data.

When to use:

  • JSON ingestion

  • Event-level semi-structured data

When NOT to use:

  • Core dimensional model exposed to BI users
    → Flatten for analytics layer


? Storage vs Compute Model

BigQuery charges per data scanned.

So:

  • Partition aggressively

  • Filter on partition columns

  • Avoid SELECT *


7️⃣ Layered Architecture Approach

I recommend 3 logical layers:

1️⃣ Raw / Bronze

  • Exact copy of source

  • Append-only

  • No transformations

2️⃣ Staging / Silver

  • Cleaned

  • Conformed data

  • Standardized formats

  • Surrogate keys added

3️⃣ Presentation / Gold

  • Star schemas

  • Business-friendly names

  • Ready for BI tools


8️⃣ Conformed Dimensions

If multiple fact tables exist:

  • Ensure shared dimensions use same keys

  • Centralize common attributes

  • Maintain a single source for dimensions like:

    • customer

    • date

    • product

This prevents metric inconsistencies.


9️⃣ Testing & Governance

In BigQuery, I implement:

  • Row count tests

  • Null checks on foreign keys

  • SCD validity checks

  • Referential integrity validation

Tools often used:

  • dbt

  • Dataform

  • Great Expectations


? Performance Optimization Patterns

✔ Materialized Views

For aggregated facts.

✔ Aggregate Fact Tables

For dashboard speed.

Example:

  • fact_orders_daily

  • fact_orders_monthly

✔ Precompute Heavy Metrics

Instead of calculating at query time.


1️⃣1️⃣ Handling Very Large Event Data

For event-heavy workloads:

  • Partition by event_date

  • Cluster by user_id

  • Store raw events nested

  • Build flattened fact table for BI


1️⃣2️⃣ Anti-Patterns to Avoid

? Snowflaking everything
? Mixing multiple grains in one fact
? Using natural keys for SCD Type 2
? No partitioning on billion-row tables
? Letting BI users query raw layer


Example Real-World Pattern

E-commerce model:

Facts

  • fact_orders

  • fact_payments

  • fact_shipments

  • fact_web_events

Dimensions

  • dim_customer (SCD2)

  • dim_product

  • dim_date

  • dim_campaign

  • dim_location


Final Philosophy

In BigQuery:

Simplicity + Partitioning + Clear Grain > Complex Normalization

Design for:

  • Query cost efficiency

  • Scalability

  • BI usability

  • Maintainability

Comments (0)

No comments yet.

© Copyright 2024. All Rights Reserved by Learningdhara Community LLP