ClusteredDistribution
ClusteredDistribution
is a Distribution that creates a HashPartitioning for the clustering expressions and a requested number of partitions.
ClusteredDistribution
requires that the clustering expressions should not be empty (i.e. Nil
).
ClusteredDistribution
is created when the following physical operators are requested for a required child distribution:
-
MapGroupsExec
, HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec, WindowExec -
Spark Structured Streaming’s
FlatMapGroupsWithStateExec
,StateStoreRestoreExec
,StateStoreSaveExec
,StreamingDeduplicateExec
,StreamingSymmetricHashJoinExec
,StreamingSymmetricHashJoinExec
-
SparkR’s
FlatMapGroupsInRExec
-
PySpark’s
FlatMapGroupsInPandasExec
ClusteredDistribution
is used when:
-
DataSourcePartitioning
,SinglePartition
,HashPartitioning
, andRangePartitioning
are requested tosatisfies
-
EnsureRequirements
is requested to add an ExchangeCoordinator for Adaptive Query Execution
createPartitioning
Method
1 2 3 4 5 |
createPartitioning(numPartitions: Int): Partitioning |
Note
|
createPartitioning is part of Distribution Contract to create a Partitioning for a given number of partitions.
|
createPartitioning
creates a HashPartitioning
for the clustering expressions and the input numPartitions
.
createPartitioning
reports an AssertionError
when the number of partitions is not the input numPartitions
.
1 2 3 4 5 |
This ClusteredDistribution requires [requiredNumPartitions] partitions, but the actual number of partitions is [numPartitions]. |
Creating ClusteredDistribution Instance
ClusteredDistribution
takes the following when created:
-
Clustering expressions
Note
|
None for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of 200 partitions).
|