Google BigQuery Shuffling
How Shuffle Memory Pressure Causes Stage Reattempts

Shuffle is the most expensive operation in BigQuery.
It happens during:
JOINGROUP BYORDER BYDISTINCTWindow functions
? What Actually Happens Internally
During shuffle:
Data is hash-partitioned by key.
Each slot is responsible for a hash bucket.
All rows for a key must fit into memory of that slot.
Each slot has:
Fixed memory
Fixed CPU
Spill-to-disk capability (limited)
? Memory Pressure Scenario
Imagine:
FROM events
GROUP BY user_id
If:
1 user_id has 50% of all rows
That key hashes to one slot
That slot receives:
Huge data volume
Memory overload
Result:
Slot spills to disk
Performance degrades
If spill exceeds limits → stage fails
BigQuery reattempts the stage
? What Is a Stage Reattempt?
BigQuery:
Detects worker failure or OOM
Restarts the stage
May repartition data differently
Consumes more slots
This increases:
Slot-ms
Runtime
Cost
? Symptoms in Execution Plan
You’ll see:
Stage retry count > 1
Large shuffle bytes
High spill to disk
One worker significantly slower
? How to Prevent Shuffle Memory Pressure
✅ 1. Reduce cardinality before shuffle
Bad:
FROM huge_table
GROUP BY user_id
Better:
Pre-aggregate sessions first
Then group
✅ 2. Break large queries into steps
Instead of 1 giant query:
Create intermediate table
Reduce row width
Reduce row count
✅ 3. Avoid extreme skew keys
Comments (0)
No comments yet.
