关注 spark技术分享,
撸spark源码 玩spark最佳实践

ExpandExec

admin阅读(1318)

ExpandExec

ExpandExec is…​FIXME

ExecutedCommandExec

admin阅读(4077)

ExecutedCommandExec Leaf Physical Operator for Command Execution

ExecutedCommandExec is a leaf physical operator for executing logical commands with side effects.

ExecutedCommandExec runs a command and caches the result in sideEffectResult internal attribute.

Table 1. ExecutedCommandExec’s Methods
Method Description

doExecute

Executes ExecutedCommandExec physical operator (and produces a result as an RDD of internal binary rows

executeCollect

executeTake

executeToIterator

Executing Logical RunnableCommand and Caching Result As InternalRows — sideEffectResult Internal Lazy Attribute

sideEffectResult requests RunnableCommand to run (that produces a Seq[Row]) and converts the result to Catalyst types using a Catalyst converter function for the schema.

Note
sideEffectResult is used when ExecutedCommandExec is requested for executeCollect, executeToIterator, executeTake, doExecute.

DebugExec

admin阅读(1585)

DebugExec Unary Physical Operator

DebugExec is a unary physical operator that…​FIXME

dumpStats Method

dumpStats…​FIXME

Note
dumpStats is used when…​FIXME

DataWritingCommandExec

admin阅读(1367)

DataWritingCommandExec Physical Operator

DataWritingCommandExec is a physical operator that is the execution environment for a DataWritingCommand logical command at execution time.

DataWritingCommandExec is created exclusively when BasicOperators execution planning strategy is requested to plan a DataWritingCommand logical command.

When requested for performance metrics, DataWritingCommandExec simply requests the DataWritingCommand for them.

Table 1. DataWritingCommandExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

sideEffectResult

Collection of InternalRows (Seq[InternalRow]) that is the result of executing the DataWritingCommand (with the SparkPlan)

Used when DataWritingCommandExec is requested to executeCollect, executeToIterator, executeTake and doExecute

Creating DataWritingCommandExec Instance

DataWritingCommandExec takes the following when created:

Executing Physical Operator and Collecting Results — executeCollect Method

Note
executeCollect is part of the SparkPlan Contract to execute the physical operator and collect results.

executeCollect…​FIXME

executeToIterator Method

Note
executeToIterator is part of the SparkPlan Contract to…​FIXME.

executeToIterator…​FIXME

Taking First N UnsafeRows — executeTake Method

Note
executeTake is part of the SparkPlan Contract to take the first n UnsafeRows.

executeTake…​FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

Note
doExecute is part of the SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).

doExecute simply requests the SQLContext for the SparkContext that is then requested to distribute (parallelize) the sideEffectResult (over 1 partition).

DataSourceV2ScanExec

admin阅读(1594)

DataSourceV2ScanExec Leaf Physical Operator

DataSourceV2ScanExec is a leaf physical operator to represent DataSourceV2Relation logical operators at execution time.

Note
A DataSourceV2Relation logical operator is created when…​FIXME

DataSourceV2ScanExec is a ColumnarBatchScan that supports vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).

DataSourceV2ScanExec is also a DataSourceReaderHolder.

DataSourceV2ScanExec is created exclusively when DataSourceV2Strategy execution planning strategy is executed and finds a DataSourceV2Relation logical operator in a logical query plan.

DataSourceV2ScanExec gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec physical operator is executed).

Table 1. DataSourceV2ScanExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

readerFactories

Collection of DataReaderFactory objects of UnsafeRows

Used when…​FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

Note
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).

doExecute…​FIXME

supportsBatch Property

Note
supportsBatch is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.

supportsBatch is enabled (i.e. true) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.

Note
enableBatchRead flag is enabled by default.

supportsBatch is disabled (i.e. false) otherwise.

Creating DataSourceV2ScanExec Instance

DataSourceV2ScanExec takes the following when created:

DataSourceV2ScanExec initializes the internal registries and counters.

Creating Input RDD of Internal Rows — inputRDD Internal Property

Note
inputRDD is a Scala lazy value which is computed once when accessed and cached afterwards.

inputRDD…​FIXME

Note
inputRDD is used when DataSourceV2ScanExec physical operator is requested for the input RDDs and to execute.

CoalesceExec

admin阅读(1484)

CoalesceExec Unary Physical Operator

CoalesceExec is a unary physical operator (i.e. with one child physical operator) to…​FIXME…​with numPartitions number of partitions and a child spark plan.

CoalesceExec represents Repartition logical operator at execution (when shuffle was disabled — see BasicOperators execution planning strategy). When executed, it executes the input child and calls coalesce on the result RDD (with shuffle disabled).

Please note that since physical operators present themselves without the suffix Exec, CoalesceExec is the Coalesce in the Physical Plan section in the following example:

output collection of Attribute matches the child‘s (since CoalesceExec is about changing the number of partitions not the internal representation).

outputPartitioning returns a SinglePartition when the input numPartitions is 1 while a UnknownPartitioning partitioning scheme for the other cases.

CartesianProductExec

admin阅读(1466)

CartesianProductExec

CartesianProductExec is…​FIXME

BroadcastNestedLoopJoinExec

admin阅读(2421)

BroadcastNestedLoopJoinExec Binary Physical Operator

BroadcastNestedLoopJoinExec is a binary physical operator (with two child left and right physical operators) that is created (and converted to) when JoinSelection physical plan strategy finds a Join logical operator that meets either case:

Note
BroadcastNestedLoopJoinExec is the default physical operator when no other operators have matched selection requirements.
Note

canBuildRight join types are:

  • CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or Existence

canBuildLeft join types are:

  • CROSS, INNER, RIGHT OUTER

Table 1. BroadcastNestedLoopJoinExec’s Performance Metrics
Key Name (in web UI) Description

numOutputRows

number of output rows

spark sql BroadcastNestedLoopJoinExec webui details for query.png
Figure 1. BroadcastNestedLoopJoinExec in web UI (Details for Query)
Table 2. BroadcastNestedLoopJoinExec’s Required Child Output Distributions
BuildSide Left Child Right Child

BuildLeft

BroadcastDistribution (uses IdentityBroadcastMode broadcast mode)

UnspecifiedDistribution

BuildRight

UnspecifiedDistribution

BroadcastDistribution (uses IdentityBroadcastMode broadcast mode)

Creating BroadcastNestedLoopJoinExec Instance

BroadcastNestedLoopJoinExec takes the following when created:

关注公众号:spark技术分享

联系我们联系我们