ClusteredDistribution-spark技术分享

ClusteredDistribution

ClusteredDistribution is a Distribution that creates a HashPartitioning for the clustering expressions and a requested number of partitions.

ClusteredDistribution requires that the clustering expressions should not be empty (i.e. Nil).

ClusteredDistribution is created when the following physical operators are requested for a required child distribution:

MapGroupsExec, HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec, WindowExec
Spark Structured Streaming’s FlatMapGroupsWithStateExec, StateStoreRestoreExec, StateStoreSaveExec, StreamingDeduplicateExec, StreamingSymmetricHashJoinExec, StreamingSymmetricHashJoinExec
SparkR’s FlatMapGroupsInRExec
PySpark’s FlatMapGroupsInPandasExec

ClusteredDistribution is used when:

DataSourcePartitioning, SinglePartition, HashPartitioning, and RangePartitioning are requested to satisfies
EnsureRequirements is requested to add an ExchangeCoordinator for Adaptive Query Execution



createPartitioning(numPartitions: Int): Partitioning

createPartitioning(numPartitions: Int): Partitioning

Note	`createPartitioning` is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning creates a HashPartitioning for the clustering expressions and the input numPartitions.

createPartitioning reports an AssertionError when the number of partitions is not the input numPartitions.



This ClusteredDistribution requires [requiredNumPartitions] partitions, but the actual number of partitions is [numPartitions].

This ClusteredDistribution requires [requiredNumPartitions] partitions, but the actual number of partitions is [numPartitions].

ClusteredDistribution takes the following when created:

Note	`None` for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of `200` partitions).