关注 spark技术分享,
撸spark源码 玩spark最佳实践

ClusteredDistribution

ClusteredDistribution

ClusteredDistribution is a Distribution that creates a HashPartitioning for the clustering expressions and a requested number of partitions.

ClusteredDistribution requires that the clustering expressions should not be empty (i.e. Nil).

ClusteredDistribution is created when the following physical operators are requested for a required child distribution:

  • MapGroupsExec, HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec, WindowExec

  • Spark Structured Streaming’s FlatMapGroupsWithStateExec, StateStoreRestoreExec, StateStoreSaveExec, StreamingDeduplicateExec, StreamingSymmetricHashJoinExec, StreamingSymmetricHashJoinExec

  • SparkR’s FlatMapGroupsInRExec

  • PySpark’s FlatMapGroupsInPandasExec

ClusteredDistribution is used when:

  • DataSourcePartitioning, SinglePartition, HashPartitioning, and RangePartitioning are requested to satisfies

  • EnsureRequirements is requested to add an ExchangeCoordinator for Adaptive Query Execution

createPartitioning Method

Note
createPartitioning is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning creates a HashPartitioning for the clustering expressions and the input numPartitions.

createPartitioning reports an AssertionError when the number of partitions is not the input numPartitions.

Creating ClusteredDistribution Instance

ClusteredDistribution takes the following when created:

  • Clustering expressions

  • Required number of partitions (default: None)

Note
None for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of 200 partitions).
赞(0) 打赏
未经允许不得转载:spark技术分享 » ClusteredDistribution
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏