GCP Data Architect Series - Part II
Dimensional modeling in BigQuery follows classic data warehouse principles (Kimball-style) — but you optimize for cloud-native, columnar, massively parallel processing.

1️⃣ Start With the Business, Not the Tables
Before touching BigQuery:
Identify business processes (orders, payments, shipments, marketing events, etc.)
Define grain clearly
“One row represents what?”
Example:
Orders fact → 1 row per order line
Web events fact → 1 row per event
If grain is unclear, everything downstream breaks.
2️⃣ Design the Star Schema
In BigQuery, I strongly favor star schemas over snowflake schemas because:
Joins are cheap in BigQuery
Storage is inexpensive
Simpler models perform well
BI tools work better with stars
Typical Structure
Fact tables → metrics + foreign keys
Dimension tables → descriptive attributes
Example:
- order_id
- order_date_key
- customer_key
- product_key
- quantity
- revenue
dim_customer
dim_product
dim_date
3️⃣ Fact Table Design in BigQuery
Best Practices
✔ Define Grain Explicitly
Never mix grains in the same fact table.
✔ Use Surrogate Keys (When Needed)
BigQuery doesn’t require integer keys for performance, but they help:
With Slowly Changing Dimensions
With model stability
With BI tool consistency
✔ Partition Large Fact Tables
Use:
PARTITION BY DATE(order_timestamp)
orPARTITION BY order_date
This reduces scan costs significantly.
✔ Cluster Smartly
Cluster by:
Foreign keys
Frequently filtered columns
Example:
CLUSTER BY customer_key, product_key
4️⃣ Dimension Table Design
Dimensions contain descriptive context.
Example:
Customer
Product
Location
Campaign
Sales rep
Slowly Changing Dimensions (SCD)
In BigQuery, I commonly implement:
? Type 1 (Overwrite)
For non-critical history.
? Type 2 (History Tracking)
Add:
effective_start_date
effective_end_date
is_current_flag
Because storage is cheap in BigQuery, Type 2 is very common.
5️⃣ Use Denormalization Strategically
BigQuery is columnar. So:
Wide tables are okay
Denormalized dimensions perform well
Avoid over-normalizing small dimensions
But:
Don’t duplicate large high-cardinality fields unnecessarily.
6️⃣ Handle BigQuery-Specific Considerations
? Nested & Repeated Fields
BigQuery supports nested data.
When to use:
JSON ingestion
Event-level semi-structured data
When NOT to use:
Core dimensional model exposed to BI users
→ Flatten for analytics layer
? Storage vs Compute Model
BigQuery charges per data scanned.
So:
Partition aggressively
Filter on partition columns
Avoid SELECT *
7️⃣ Layered Architecture Approach
I recommend 3 logical layers:
1️⃣ Raw / Bronze
Exact copy of source
Append-only
No transformations
2️⃣ Staging / Silver
Cleaned
Conformed data
Standardized formats
Surrogate keys added
3️⃣ Presentation / Gold
Star schemas
Business-friendly names
Ready for BI tools
8️⃣ Conformed Dimensions
If multiple fact tables exist:
Ensure shared dimensions use same keys
Centralize common attributes
Maintain a single source for dimensions like:
customer
date
product
This prevents metric inconsistencies.
9️⃣ Testing & Governance
In BigQuery, I implement:
Row count tests
Null checks on foreign keys
SCD validity checks
Referential integrity validation
Tools often used:
dbt
Dataform
Great Expectations
? Performance Optimization Patterns
✔ Materialized Views
For aggregated facts.
✔ Aggregate Fact Tables
For dashboard speed.
Example:
fact_orders_daily
fact_orders_monthly
✔ Precompute Heavy Metrics
Instead of calculating at query time.
1️⃣1️⃣ Handling Very Large Event Data
For event-heavy workloads:
Partition by event_date
Cluster by user_id
Store raw events nested
Build flattened fact table for BI
1️⃣2️⃣ Anti-Patterns to Avoid
? Snowflaking everything
? Mixing multiple grains in one fact
? Using natural keys for SCD Type 2
? No partitioning on billion-row tables
? Letting BI users query raw layer
Example Real-World Pattern
E-commerce model:
Facts
fact_orders
fact_payments
fact_shipments
fact_web_events
Dimensions
dim_customer (SCD2)
dim_product
dim_date
dim_campaign
dim_location
Final Philosophy
In BigQuery:
Simplicity + Partitioning + Clear Grain > Complex Normalization
Design for:
Query cost efficiency
Scalability
BI usability
Maintainability
Comments (0)
No comments yet.
