关注 spark技术分享,
撸spark源码 玩spark最佳实践

Number of Partitions for groupBy Aggregation

Case Study: Number of Partitions for groupBy Aggregation

Important

As it fairly often happens in my life, right after I had described the discovery I found out I was wrong and the “Aha moment” was gone.

Until I thought about the issue again and took the shortest path possible. See Case 4 for the definitive solution.

I’m leaving the page with no changes in-between so you can read it and learn from my mistakes.

The goal of the case study is to fine tune the number of partitions used for groupBy aggregation.

Given the following 2-partition dataset the task is to write a structured query so there are no empty partitions (or as little as possible).

Note

By default Spark SQL uses spark.sql.shuffle.partitions number of partitions for aggregations and joins, i.e. 200 by default.

That often leads to explosion of partitions for nothing that does impact the performance of a query since these 200 tasks (per partition) have all to start and finish before you get the result.

Less is more remember?

Case 1: Default Number of Partitions — spark.sql.shuffle.partitions Property

This is the moment when you learn that sometimes relying on defaults may lead to poor performance.

Think how many partitions the following query really requires?

You may have expected to have at most 2 partitions given the number of groups.

Wrong!

When you execute the query you should see 200 or so partitions in use in web UI.

spark sql performance tuning groupBy aggregation case1.png
Figure 1. Case 1’s Physical Plan with Default Number of Partitions
Note
The number of Succeeded Jobs is 5.

Case 2: Using repartition Operator

Let’s rewrite the query to use repartition operator.

repartition operator is indeed a step in a right direction when used with caution as it may lead to an unnecessary shuffle (aka exchange in Spark SQL’s parlance).

Think how many partitions the following query really requires?

You may have expected 2 partitions again?!

Wrong!

Compare the physical plans of the two queries and you will surely regret using repartition operator in the latter as you did cause an extra shuffle stage (!)

Case 3: Using repartition Operator With Explicit Number of Partitions

The discovery of the day is to notice that repartition operator accepts an additional parameter for…​the number of partitions (!)

As a matter of fact, there are two variants of repartition operator with the number of partitions and the trick is to use the one with partition expressions (that will be used for grouping as well as…​hash partitioning).

Can you think of the number of partitions the following query uses? I’m sure you have guessed correctly!

You may have expected 2 partitions again?!

Correct!

Congratulations! You are done.

Not quite. Read along!

Case 4: Remember spark.sql.shuffle.partitions Property? Set It Up Properly

spark sql performance tuning groupBy aggregation case4.png
Figure 2. Case 4’s Physical Plan with Custom Number of Partitions
Note
The number of Succeeded Jobs is 2.

Congratulations! You are done now.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Number of Partitions for groupBy Aggregation
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏