logo

Google BigQuery Shuffling

How Shuffle Memory Pressure Causes Stage Reattempts

AdminFollow
5 minFeb 28, 2026
Views - 16
Google BigQuery Shuffling

Shuffle is the most expensive operation in BigQuery.

It happens during:

  • JOIN

  • GROUP BY

  • ORDER BY

  • DISTINCT

  • Window functions


? What Actually Happens Internally

During shuffle:

  1. Data is hash-partitioned by key.

  2. Each slot is responsible for a hash bucket.

  3. All rows for a key must fit into memory of that slot.

Each slot has:

  • Fixed memory

  • Fixed CPU

  • Spill-to-disk capability (limited)


? Memory Pressure Scenario

Imagine:

 
SELECT user_id, COUNT(*)
FROM events
GROUP BY user_id
 

If:

  • 1 user_id has 50% of all rows

  • That key hashes to one slot

That slot receives:

  • Huge data volume

  • Memory overload

Result:

  • Slot spills to disk

  • Performance degrades

  • If spill exceeds limits → stage fails

  • BigQuery reattempts the stage


? What Is a Stage Reattempt?

BigQuery:

  • Detects worker failure or OOM

  • Restarts the stage

  • May repartition data differently

  • Consumes more slots

This increases:

  • Slot-ms

  • Runtime

  • Cost


? Symptoms in Execution Plan

You’ll see:

  • Stage retry count > 1

  • Large shuffle bytes

  • High spill to disk

  • One worker significantly slower


? How to Prevent Shuffle Memory Pressure

✅ 1. Reduce cardinality before shuffle

Bad:

 
SELECT user_id, COUNT(DISTINCT session_id)
FROM huge_table
GROUP BY user_id
 

Better:

  • Pre-aggregate sessions first

  • Then group


✅ 2. Break large queries into steps

Instead of 1 giant query:

  • Create intermediate table

  • Reduce row width

  • Reduce row count


✅ 3. Avoid extreme skew keys

Comments (0)

No comments yet.

© Copyright 2024. All Rights Reserved by Learningdhara Community LLP