BQ Event Architecture
Designing a 1PB Event Architecture

? Assumptions
1PB raw events
Billions of events/day
Real-time + batch analytics
100+ concurrent users
? Layered Architecture
? Layer 1: Raw Events
Table:
Design:
Partition by event_date
Cluster by user_id
Nested schema
No joins required
Keep raw immutable.
? Layer 2: Enriched Events
Denormalize here.
Instead of:
JOIN products
JOIN campaigns
Flatten during ingestion.
Reduces shuffle later.
? Layer 3: Aggregation Tables
Create:
Daily user metrics
Session-level aggregates
Campaign performance rollups
BI never hits raw tables.
? Storage Strategy
At 1PB:
Partition by date (mandatory)
Consider multi-column clustering
Monitor partition size (avoid tiny partitions)
? Streaming vs Batch
Streaming:
Higher cost
Lower latency
Micro-partitioned storage
Batch load:
Cheaper
Better compression
For PB-scale:
→ Prefer batch ingestion where possible.
? Schema Design Principles
Use:
Nested/repeated fields
Avoid snowflake schema
Avoid small dimension joins
Denormalization reduces shuffle massively.
? Biggest Cost Driver at 1PB
Not storage.
It’s shuffle-heavy ad-hoc joins on raw data.
Comments (0)
No comments yet.
