关注 spark技术分享,
撸spark源码 玩spark最佳实践

SortMergeJoinExec

SortMergeJoinExec Binary Physical Operator for Sort Merge Join

SortMergeJoinExec is a binary physical operator to execute a sort merge join.

ShuffledHashJoinExec is selected to represent a Join logical operator when JoinSelection execution planning strategy is executed for joins with left join keys that are orderable, i.e. that can be ordered (sorted).

Note

A join key is orderable when is of one of the following data types:

  • NullType

  • AtomicType (that represents all the available types except NullType, StructType, ArrayType, UserDefinedType, MapType, and ObjectType)

  • StructType with orderable fields

  • ArrayType of orderable type

  • UserDefinedType of orderable type

Therefore, a join key is not orderable when is of the following data type:

  • MapType

  • ObjectType

Note

spark.sql.join.preferSortMergeJoin is an internal configuration property and is enabled by default.

That means that JoinSelection execution planning strategy (and so Spark Planner) prefers sort merge join over shuffled hash join.

SortMergeJoinExec supports Java code generation (aka codegen) for inner and cross joins.

Tip

Enable DEBUG logging level for org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys logger to see the join condition and the left and right join keys.

Table 1. SortMergeJoinExec’s Performance Metrics
Key Name (in web UI) Description

numOutputRows

number of output rows

spark sql SortMergeJoinExec webui query details.png
Figure 1. SortMergeJoinExec in web UI (Details for Query)
Note
The prefix for variable names for SortMergeJoinExec operators in CodegenSupport-generated code is smj.

The output schema of a SortMergeJoinExec is…​FIXME

The outputPartitioning of a SortMergeJoinExec is…​FIXME

The outputOrdering of a SortMergeJoinExec is…​FIXME

The partitioning requirements of the input of a SortMergeJoinExec (aka child output distributions) are HashClusteredDistributions of left and right join keys.

Table 2. SortMergeJoinExec’s Required Child Output Distributions
Left Child Right Child

HashClusteredDistribution (per left join key expressions)

HashClusteredDistribution (per right join key expressions)

The ordering requirements of the input of a SortMergeJoinExec (aka child output ordering) is…​FIXME

Note
SortMergeJoinExec operator is chosen in JoinSelection execution planning strategy (after BroadcastHashJoinExec and ShuffledHashJoinExec physical join operators have not met the requirements).

Generating Java Source Code for Produce Path in Whole-Stage Code Generation — doProduce Method

Note
doProduce is part of CodegenSupport Contract to generate the Java source code for produce path in Whole-Stage Code Generation.

doProduce…​FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

Note
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).

doExecute…​FIXME

Creating SortMergeJoinExec Instance

SortMergeJoinExec takes the following when created:

赞(0) 打赏
未经允许不得转载:spark技术分享 » SortMergeJoinExec
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏