spark-sql-spark技术分享-第21页

ExternalRDDScanExec Leaf Physical Operator

ExternalRDDScanExec is a leaf physical operator for…FIXME

ExpandExec

ExpandExec is…FIXME

ExecutedCommandExec Leaf Physical Operator for Command Execution

ExecutedCommandExec is a leaf physical operator for executing logical commands with side effects.

ExecutedCommandExec runs a command and caches the result in sideEffectResult internal attribute.

Table 1. ExecutedCommandExec’s Methods
Method	Description
`doExecute`	Executes `ExecutedCommandExec` physical operator (and produces a result as an RDD of internal binary rows
`executeCollect`
`executeTake`
`executeToIterator`

Executing Logical RunnableCommand and Caching Result As InternalRows — `sideEffectResult` Internal Lazy Attribute



sideEffectResult: Seq[InternalRow]

1

2

3

4

5

sideEffectResult: Seq[InternalRow]

sideEffectResult requests RunnableCommand to run (that produces a Seq[Row]) and converts the result to Catalyst types using a Catalyst converter function for the schema.

Note	`sideEffectResult` is used when `ExecutedCommandExec` is requested for executeCollect, executeToIterator, executeTake, doExecute.

DeserializeToObjectExec

DeserializeToObjectExec is…FIXME

DebugExec Unary Physical Operator

DebugExec is a unary physical operator that…FIXME

`dumpStats` Method



dumpStats(): Unit

1

2

3

4

5

dumpStats(): Unit

dumpStats…FIXME

Note	`dumpStats` is used when…FIXME

DataWritingCommandExec Physical Operator

DataWritingCommandExec is a physical operator that is the execution environment for a DataWritingCommand logical command at execution time.

DataWritingCommandExec is created exclusively when BasicOperators execution planning strategy is requested to plan a DataWritingCommand logical command.

When requested for performance metrics, DataWritingCommandExec simply requests the DataWritingCommand for them.

Table 1. DataWritingCommandExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`sideEffectResult`	Collection of InternalRows (`Seq[InternalRow]`) that is the result of executing the DataWritingCommand (with the SparkPlan) Used when `DataWritingCommandExec` is requested to executeCollect, executeToIterator, executeTake and doExecute

Creating DataWritingCommandExec Instance

DataWritingCommandExec takes the following when created:

DataWritingCommand
Child physical plan

Executing Physical Operator and Collecting Results — `executeCollect` Method



executeCollect(): Array[InternalRow]

1

2

3

4

5

executeCollect(): Array[InternalRow]

Note	`executeCollect` is part of the SparkPlan Contract to execute the physical operator and collect results.

executeCollect…FIXME

`executeToIterator` Method



executeToIterator: Iterator[InternalRow]

1

2

3

4

5

executeToIterator: Iterator[InternalRow]

Note	`executeToIterator` is part of the SparkPlan Contract to…FIXME.

executeToIterator…FIXME

Taking First N UnsafeRows — `executeTake` Method



executeTake(limit: Int): Array[InternalRow]

1

2

3

4

5

executeTake(limit: Int): Array[InternalRow]

Note	`executeTake` is part of the SparkPlan Contract to take the first n `UnsafeRows`.

executeTake…FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — `doExecute` Method



doExecute(): RDD[InternalRow]

1

2

3

4

5

doExecute(): RDD[InternalRow]

Note	`doExecute` is part of the SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. `RDD[InternalRow]`).

doExecute simply requests the SQLContext for the SparkContext that is then requested to distribute (parallelize) the sideEffectResult (over 1 partition).

DataSourceV2ScanExec Leaf Physical Operator

DataSourceV2ScanExec is a leaf physical operator to represent DataSourceV2Relation logical operators at execution time.

Note	A DataSourceV2Relation logical operator is created when…FIXME

DataSourceV2ScanExec is a ColumnarBatchScan that supports vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).

DataSourceV2ScanExec is also a DataSourceReaderHolder.

DataSourceV2ScanExec is created exclusively when DataSourceV2Strategy execution planning strategy is executed and finds a DataSourceV2Relation logical operator in a logical query plan.

DataSourceV2ScanExec gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec physical operator is executed).

Table 1. DataSourceV2ScanExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`readerFactories`	Collection of DataReaderFactory objects of UnsafeRows Used when…FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — `doExecute` Method



doExecute(): RDD[InternalRow]

1

2

3

4

5

doExecute(): RDD[InternalRow]

Note	`doExecute` is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. `RDD[InternalRow]`).

doExecute…FIXME

`supportsBatch` Property



supportsBatch: Boolean

1

2

3

4

5

supportsBatch: Boolean

Note	`supportsBatch` is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.

supportsBatch is enabled (i.e. true) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.

Note	enableBatchRead flag is enabled by default.

supportsBatch is disabled (i.e. false) otherwise.

Creating DataSourceV2ScanExec Instance

DataSourceV2ScanExec takes the following when created:

Output schema (as a collection of AttributeReferences)
DataSourceReader

DataSourceV2ScanExec initializes the internal registries and counters.

Creating Input RDD of Internal Rows — `inputRDD` Internal Property



inputRDD: RDD[InternalRow]

1

2

3

4

5

inputRDD: RDD[InternalRow]

Note	`inputRDD` is a Scala lazy value which is computed once when accessed and cached afterwards.

inputRDD…FIXME

Note	`inputRDD` is used when `DataSourceV2ScanExec` physical operator is requested for the input RDDs and to execute.

CoalesceExec Unary Physical Operator

CoalesceExec is a unary physical operator (i.e. with one child physical operator) to…FIXME…with numPartitions number of partitions and a child spark plan.

CoalesceExec represents Repartition logical operator at execution (when shuffle was disabled — see BasicOperators execution planning strategy). When executed, it executes the input child and calls coalesce on the result RDD (with shuffle disabled).

Please note that since physical operators present themselves without the suffix Exec, CoalesceExec is the Coalesce in the Physical Plan section in the following example:



scala> df.rdd.getNumPartitions
res6: Int = 8

scala> df.coalesce(1).rdd.getNumPartitions
res7: Int = 1

scala> df.coalesce(1).explain(extended = true)
== Parsed Logical Plan ==
Repartition 1, false
+- LocalRelation [value#1]

== Analyzed Logical Plan ==
value: int
Repartition 1, false
+- LocalRelation [value#1]

== Optimized Logical Plan ==
Repartition 1, false
+- LocalRelation [value#1]

== Physical Plan ==
Coalesce 1
+- LocalTableScan [value#1]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

scala> df.rdd.getNumPartitions

res6: Int = 8

scala> df.coalesce(1).rdd.getNumPartitions

res7: Int = 1

scala> df.coalesce(1).explain(extended = true)

== Parsed Logical Plan ==

Repartition 1, false

+- LocalRelation [value#1]

== Analyzed Logical Plan ==

value: int

Repartition 1, false

+- LocalRelation [value#1]

== Optimized Logical Plan ==

Repartition 1, false

+- LocalRelation [value#1]

== Physical Plan ==

Coalesce 1

+- LocalTableScan [value#1]

output collection of Attribute matches the child‘s (since CoalesceExec is about changing the number of partitions not the internal representation).

outputPartitioning returns a SinglePartition when the input numPartitions is 1 while a UnknownPartitioning partitioning scheme for the other cases.

CartesianProductExec

2013-01-05admin阅读(1737)

CartesianProductExec

CartesianProductExec is…FIXME

BroadcastNestedLoopJoinExec

2013-01-04admin阅读(2680)

BroadcastNestedLoopJoinExec Binary Physical Operator

BroadcastNestedLoopJoinExec is a binary physical operator (with two child left and right physical operators) that is created (and converted to) when JoinSelection physical plan strategy finds a Join logical operator that meets either case:

canBuildRight join type and right physical operator broadcastable
canBuildLeft join type and left broadcastable
non-InnerLike join type

Note	`BroadcastNestedLoopJoinExec` is the default physical operator when no other operators have matched selection requirements.

Note	canBuildRight join types are: CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or Existence canBuildLeft join types are: CROSS, INNER, RIGHT OUTER



val nums = spark.range(2)
val letters = ('a' to 'c').map(_.toString).toDF("letter")
val q = nums.crossJoin(letters)

scala> q.explain
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Cross
:- *Range (0, 2, step=1, splits=Some(8))
+- BroadcastExchange IdentityBroadcastMode
   +- LocalTableScan [letter#69]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

val nums = spark.range(2)

val letters = ('a' to 'c').map(_.toString).toDF("letter")

val q = nums.crossJoin(letters)

scala> q.explain

== Physical Plan ==

BroadcastNestedLoopJoin BuildRight, Cross

:- *Range (0, 2, step=1, splits=Some(8))

+- BroadcastExchange IdentityBroadcastMode

+- LocalTableScan [letter#69]

Table 1. BroadcastNestedLoopJoinExec’s Performance Metrics
Key	Name (in web UI)	Description
`numOutputRows`	number of output rows

spark sql BroadcastNestedLoopJoinExec webui details for query.png

Figure 1. BroadcastNestedLoopJoinExec in web UI (Details for Query)

Table 2. BroadcastNestedLoopJoinExec’s Required Child Output Distributions
BuildSide	Left Child	Right Child
`BuildLeft`	BroadcastDistribution (uses `IdentityBroadcastMode` broadcast mode)	UnspecifiedDistribution
`BuildRight`	UnspecifiedDistribution	BroadcastDistribution (uses `IdentityBroadcastMode` broadcast mode)

Creating BroadcastNestedLoopJoinExec Instance

BroadcastNestedLoopJoinExec takes the following when created:

Left physical operator
Right physical operator
BuildSide
Join type
Optional join condition expressions

spark-sql 第21页

ExternalRDDScanExec

ExternalRDDScanExec Leaf Physical Operator

ExpandExec

ExpandExec

ExecutedCommandExec

ExecutedCommandExec Leaf Physical Operator for Command Execution

Executing Logical RunnableCommand and Caching Result As InternalRows — `sideEffectResult` Internal Lazy Attribute

DeserializeToObjectExec

DeserializeToObjectExec

DebugExec

DebugExec Unary Physical Operator

`dumpStats` Method

DataWritingCommandExec

DataWritingCommandExec Physical Operator

Creating DataWritingCommandExec Instance

Executing Physical Operator and Collecting Results — `executeCollect` Method

`executeToIterator` Method

Taking First N UnsafeRows — `executeTake` Method

Executing Physical Operator (Generating RDD[InternalRow]) — `doExecute` Method

DataSourceV2ScanExec

DataSourceV2ScanExec Leaf Physical Operator

Executing Physical Operator (Generating RDD[InternalRow]) — `doExecute` Method

`supportsBatch` Property

Creating DataSourceV2ScanExec Instance

Creating Input RDD of Internal Rows — `inputRDD` Internal Property

CoalesceExec

CoalesceExec Unary Physical Operator

CartesianProductExec

CartesianProductExec

BroadcastNestedLoopJoinExec

BroadcastNestedLoopJoinExec Binary Physical Operator

Creating BroadcastNestedLoopJoinExec Instance

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

spark-sql 第21页

ExternalRDDScanExec Leaf Physical Operator

ExpandExec

ExecutedCommandExec Leaf Physical Operator for Command Execution

Executing Logical RunnableCommand and Caching Result As InternalRows — sideEffectResult Internal Lazy Attribute

DeserializeToObjectExec

DebugExec Unary Physical Operator

dumpStats Method

DataWritingCommandExec Physical Operator

Creating DataWritingCommandExec Instance

Executing Physical Operator and Collecting Results — executeCollect Method

executeToIterator Method

Taking First N UnsafeRows — executeTake Method

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

DataSourceV2ScanExec Leaf Physical Operator

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

supportsBatch Property

Creating DataSourceV2ScanExec Instance

Creating Input RDD of Internal Rows — inputRDD Internal Property

CoalesceExec Unary Physical Operator

CartesianProductExec

BroadcastNestedLoopJoinExec Binary Physical Operator

Creating BroadcastNestedLoopJoinExec Instance

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Executing Logical RunnableCommand and Caching Result As InternalRows — `sideEffectResult` Internal Lazy Attribute

`dumpStats` Method

Executing Physical Operator and Collecting Results — `executeCollect` Method

`executeToIterator` Method

Taking First N UnsafeRows — `executeTake` Method

Executing Physical Operator (Generating RDD[InternalRow]) — `doExecute` Method

Executing Physical Operator (Generating RDD[InternalRow]) — `doExecute` Method

`supportsBatch` Property

Creating Input RDD of Internal Rows — `inputRDD` Internal Property