Designing a 10PB Architecture
Designing a 10PB Architecture

At 10PB, the problem shifts from query optimization to platform control.
The biggest risks are:
Unbounded shuffle cost
Skew amplification
Analyst misuse of raw tables
Cross-region sprawl
Slot starvation
Governance failure
? 10PB Design Principles
1. Strict Layering (Non-Negotiable)
Raw → Curated → Aggregated → Serving
Users never touch raw.
↓
Enriched Wide Tables (denormalized)
↓
Domain Aggregates (daily/hourly rollups)
↓
Serving Layer (BI / ML)
? Raw Layer Design (10PB scale)
Partition by event_date
Cluster by highest cardinality key (e.g., user_id)
Nested schema (avoid joins)
Column pruning mandatory
Never allow ad-hoc queries here.
? Shuffle Minimization Strategy
At 10PB:
The only safe shuffle is the one you never perform.
Strategies:
Pre-join dimension tables at ingestion
Pre-aggregate frequently queried metrics
Use materialized views for common rollups
Replace COUNT(DISTINCT) with HLL where possible
? Slot Architecture
At 10PB, you’ll typically need:
| Workload | Slots |
|---|---|
| ETL | 20k–50k |
| BI | 5k–10k |
| ML | 5k–15k |
Separate reservations are mandatory.
? Metadata Strategy
Use:
Data catalog tagging
Column-level governance
Query access controls
At this scale, governance reduces cost more than tuning.
Comments (0)
No comments yet.
