spark-sql-spark技术分享-第12页

DataSourceV2Strategy Execution Planning Strategy

DataSourceV2Strategy is an execution planning strategy that Spark Planner uses to FIXME.

Applying DataSourceV2Strategy Strategy to Logical Plan (Executing DataSourceV2Strategy) — `apply` Method



apply(plan: LogicalPlan): Seq[SparkPlan]

apply(plan: LogicalPlan): Seq[SparkPlan]

Note	`apply` is part of GenericStrategy Contract to generate a collection of SparkPlans for a given logical plan.

apply…FIXME

DataSourceStrategy Execution Planning Strategy

DataSourceStrategy is an execution planning strategy (of SparkPlanner) that plans LogicalRelation logical operators as RowDataSourceScanExec physical operators (possibly under FilterExec and ProjectExec operators).

Table 1. DataSourceStrategy’s Selection Requirements (in execution order)
Logical Operator	Description
LogicalRelation with a CatalystScan relation	Uses pruneFilterProjectRaw (with the RDD conversion to RDD[InternalRow] as part of `scanBuilder`). `CatalystScan` does not seem to be used in Spark SQL.
LogicalRelation with PrunedFilteredScan relation	Uses pruneFilterProject (with the RDD conversion to RDD[InternalRow] as part of `scanBuilder`). Matches JDBCRelation exclusively
LogicalRelation with a PrunedScan relation	Uses pruneFilterProject (with the RDD conversion to RDD[InternalRow] as part of `scanBuilder`). `PrunedScan` does not seem to be used in Spark SQL.
LogicalRelation with a TableScan relation	Creates a RowDataSourceScanExec directly (requesting the `TableScan` to buildScan followed by RDD conversion to RDD[InternalRow]) Matches KafkaRelation exclusively



import org.apache.spark.sql.execution.datasources.DataSourceStrategy
val strategy = DataSourceStrategy(spark.sessionState.conf)

import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
val plan: LogicalPlan = ???

val sparkPlan = strategy(plan).head

import org.apache.spark.sql.execution.datasources.DataSourceStrategy

val strategy = DataSourceStrategy(spark.sessionState.conf)

import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

val plan: LogicalPlan = ???

val sparkPlan = strategy(plan).head

Note	`DataSourceStrategy` uses PhysicalOperation Scala extractor object to destructure a logical query plan.

`pruneFilterProject` Internal Method



pruneFilterProject(
  relation: LogicalRelation,
  projects: Seq[NamedExpression],
  filterPredicates: Seq[Expression],
  scanBuilder: (Seq[Attribute], Array[Filter]) => RDD[InternalRow])

pruneFilterProject(

relation: LogicalRelation,

projects: Seq[NamedExpression],

filterPredicates: Seq[Expression],

scanBuilder: (Seq[Attribute], Array[Filter]) => RDD[InternalRow])

pruneFilterProject simply calls pruneFilterProjectRaw with scanBuilder ignoring the Seq[Expression] input parameter.

Note	`pruneFilterProject` is used when `DataSourceStrategy` execution planning strategy is executed (for LogicalRelation logical operators with a PrunedFilteredScan or a PrunedScan).

Selecting Catalyst Expressions Convertible to Data Source Filter Predicates (and Handled by BaseRelation) — `selectFilters` Method



selectFilters(
  relation: BaseRelation,
  predicates: Seq[Expression]): (Seq[Expression], Seq[Filter], Set[Filter])

selectFilters(

relation: BaseRelation,

predicates: Seq[Expression]): (Seq[Expression], Seq[Filter], Set[Filter])

selectFilters builds a map of Catalyst predicate expressions (from the input predicates) that can be translated to a data source filter predicate.

selectFilters then requests the input BaseRelation for unhandled filters (out of the convertible ones that selectFilters built the map with).

In the end, selectFilters returns a 3-element tuple with the following:

Inconvertible and unhandled Catalyst predicate expressions
All converted data source filters
Pushed-down data source filters (that the input BaseRelation can handle)

Note	`selectFilters` is used exclusively when `DataSourceStrategy` execution planning strategy is requested to create a RowDataSourceScanExec physical operator (possibly under FilterExec and ProjectExec operators) (which is when `DataSourceStrategy` is executed and pruneFilterProject).

Translating Catalyst Expression Into Data Source Filter Predicate — `translateFilter` Method



translateFilter(predicate: Expression): Option[Filter]

translateFilter(predicate: Expression): Option[Filter]

translateFilter translates a Catalyst expression into a corresponding Filter predicate if possible. If not, translateFilter returns None.

Table 2. translateFilter’s Conversions
Catalyst Expression	Filter Predicate
`EqualTo`	`EqualTo`
`EqualNullSafe`	`EqualNullSafe`
`GreaterThan`	`GreaterThan`
`LessThan`	`LessThan`
`GreaterThanOrEqual`	`GreaterThanOrEqual`
`LessThanOrEqual`	`LessThanOrEqual`
InSet	`In`
In	`In`
`IsNull`	`IsNull`
`IsNotNull`	`IsNotNull`
`And`	`And`
`Or`	`Or`
`Not`	`Not`
`StartsWith`	`StringStartsWith`
`EndsWith`	`StringEndsWith`
`Contains`	`StringContains`

Note	The Catalyst expressions and their corresponding data source filter predicates have the same names in most cases but belong to different Scala packages, i.e. `org.apache.spark.sql.catalyst.expressions` and `org.apache.spark.sql.sources`, respectively.

Note

translateFilter is used when:

FileSourceScanExec is created (and initializes pushedDownFilters)
DataSourceStrategy is requested to selectFilters
PushDownOperatorsToDataSource logical optimization is executed (for DataSourceV2Relation leaf operators with a SupportsPushDownFilters data source reader)

RDD Conversion (Converting RDD of Rows to Catalyst RDD of InternalRows) — `toCatalystRDD` Internal Method



toCatalystRDD(
  relation: LogicalRelation,
  output: Seq[Attribute],
  rdd: RDD[Row]): RDD[InternalRow]
toCatalystRDD(relation: LogicalRelation, rdd: RDD[Row]) (1)

toCatalystRDD(

relation: LogicalRelation,

output: Seq[Attribute],

rdd: RDD[Row]): RDD[InternalRow]

toCatalystRDD(relation: LogicalRelation, rdd: RDD[Row]) (1)

Calls the former toCatalystRDD with the output of the LogicalRelation

toCatalystRDD branches off per the needConversion flag of the BaseRelation of the input LogicalRelation.

When enabled (true), toCatalystRDD converts the objects inside Rows to Catalyst types.

Note	needConversion flag is enabled (`true`) by default.

Otherwise, toCatalystRDD simply casts the input RDD[Row] to a RDD[InternalRow] (as a simple untyped Scala type conversion using Java’s asInstanceOf operator).

Note	`toCatalystRDD` is used when `DataSourceStrategy` execution planning strategy is executed (for all kinds of BaseRelations).

Creating RowDataSourceScanExec Physical Operator for LogicalRelation (Possibly Under FilterExec and ProjectExec Operators) — `pruneFilterProjectRaw` Internal Method



pruneFilterProjectRaw(
  relation: LogicalRelation,
  projects: Seq[NamedExpression],
  filterPredicates: Seq[Expression],
  scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter]) => RDD[InternalRow]): SparkPlan

pruneFilterProjectRaw(

relation: LogicalRelation,

projects: Seq[NamedExpression],

filterPredicates: Seq[Expression],

scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter]) => RDD[InternalRow]): SparkPlan

pruneFilterProjectRaw creates a RowDataSourceScanExec leaf physical operator given a LogicalRelation leaf logical operator (possibly as a child of a FilterExec and a ProjectExec unary physical operators).

In other words, pruneFilterProjectRaw simply converts a LogicalRelation leaf logical operator into a RowDataSourceScanExec leaf physical operator (possibly under a FilterExec and a ProjectExec unary physical operators).

Note	`pruneFilterProjectRaw` is almost like SparkPlanner.pruneFilterProject.

Internally, pruneFilterProjectRaw splits the input filterPredicates expressions to select the Catalyst expressions that can be converted to data source filter predicates (and handled by the BaseRelation of the LogicalRelation).

pruneFilterProjectRaw combines all expressions that are neither convertible to data source filters nor can be handled by the relation using And binary expression (that creates a so-called filterCondition that will eventually be used to create a FilterExec physical operator if non-empty).

pruneFilterProjectRaw creates a RowDataSourceScanExec leaf physical operator.

If it is possible to use a column pruning only to get the right projection and if the columns of this projection are enough to evaluate all filter conditions, pruneFilterProjectRaw creates a FilterExec unary physical operator (with the unhandled predicate expressions and the RowDataSourceScanExec leaf physical operator as the child).

Note	In this case no extra ProjectExec unary physical operator is created.

Otherwise, pruneFilterProjectRaw creates a FilterExec unary physical operator (with the unhandled predicate expressions and the RowDataSourceScanExec leaf physical operator as the child) that in turn becomes the child of a new ProjectExec unary physical operator.

Note	`pruneFilterProjectRaw` is used exclusively when `DataSourceStrategy` execution planning strategy is executed (for a `LogicalRelation` with a `CatalystScan` relation) and pruneFilterProject (when executed for a `LogicalRelation` with a `PrunedFilteredScan` or a `PrunedScan` relation).

BasicOperators

2013-04-10admin阅读(1665)

BasicOperators Execution Planning Strategy

BasicOperators is an execution planning strategy (of SparkPlanner) that in general does simple conversions from logical operators to their physical counterparts.

Table 1. BasicOperators’ Logical to Physical Operator Conversions
Logical Operator	Physical Operator
RunnableCommand	ExecutedCommandExec
MemoryPlan	LocalTableScanExec
DeserializeToObject	`DeserializeToObjectExec`
`SerializeFromObject`	`SerializeFromObjectExec`
`MapPartitions`	`MapPartitionsExec`
`MapElements`	`MapElementsExec`
`AppendColumns`	`AppendColumnsExec`
`AppendColumnsWithObject`	`AppendColumnsWithObjectExec`
`MapGroups`	`MapGroupsExec`
`CoGroup`	`CoGroupExec`
`Repartition` (with shuffle enabled)	ShuffleExchangeExec
`Repartition`	CoalesceExec
`SortPartitions`	SortExec
Sort	SortExec
`Project`	`ProjectExec`
`Filter`	FilterExec
`TypedFilter`	FilterExec
Expand	`ExpandExec`
Window	WindowExec
`Sample`	`SampleExec`
LocalRelation	LocalTableScanExec
`LocalLimit`	`LocalLimitExec`
`GlobalLimit`	`GlobalLimitExec`
`Union`	`UnionExec`
Generate	GenerateExec
`OneRowRelation`	RDDScanExec
`Range`	RangeExec
`RepartitionByExpression`	ShuffleExchangeExec
ExternalRDD	ExternalRDDScanExec
LogicalRDD	RDDScanExec

Tip	Confirm the operator mapping in the source code of `BasicOperators`.

Note	`BasicOperators` expects that `Distinct`, `Intersect`, and `Except` logical operators are not used in a logical plan and throws a `IllegalStateException` if not.

Aggregation

2013-04-09admin阅读(1890)

Aggregation Execution Planning Strategy for Aggregate Physical Operators

Aggregation is an execution planning strategy that SparkPlanner uses to select aggregate physical operator for Aggregate logical operator in a logical query plan.



scala> :type spark
org.apache.spark.sql.SparkSession

// structured query with count aggregate function
val q = spark
  .range(5)
  .groupBy($"id" % 2 as "group")
  .agg(count("id") as "count")
val plan = q.queryExecution.optimizedPlan
scala> println(plan.numberedTreeString)
00 Aggregate [(id#0L % 2)], [(id#0L % 2) AS group#3L, count(1) AS count#8L]
01 +- Range (0, 5, step=1, splits=Some(8))

import spark.sessionState.planner.Aggregation
val physicalPlan = Aggregation.apply(plan)

// HashAggregateExec selected
scala> println(physicalPlan.head.numberedTreeString)
00 HashAggregate(keys=[(id#0L % 2)#12L], functions=[count(1)], output=[group#3L, count#8L])
01 +- HashAggregate(keys=[(id#0L % 2) AS (id#0L % 2)#12L], functions=[partial_count(1)], output=[(id#0L % 2)#12L, count#14L])
02    +- PlanLater Range (0, 5, step=1, splits=Some(8))

scala> :type spark

org.apache.spark.sql.SparkSession

// structured query with count aggregate function

val q = spark

.range(5)

.groupBy($"id" % 2 as "group")

.agg(count("id") as "count")

val plan = q.queryExecution.optimizedPlan

scala> println(plan.numberedTreeString)

00 Aggregate [(id#0L % 2)], [(id#0L % 2) AS group#3L, count(1) AS count#8L]

01 +- Range (0, 5, step=1, splits=Some(8))

import spark.sessionState.planner.Aggregation

val physicalPlan = Aggregation.apply(plan)

// HashAggregateExec selected

scala> println(physicalPlan.head.numberedTreeString)

00 HashAggregate(keys=[(id#0L % 2)#12L], functions=[count(1)], output=[group#3L, count#8L])

01 +- HashAggregate(keys=[(id#0L % 2) AS (id#0L % 2)#12L], functions=[partial_count(1)], output=[(id#0L % 2)#12L, count#14L])

02 +- PlanLater Range (0, 5, step=1, splits=Some(8))

Aggregation can select the following aggregate physical operators (in the order of preference):

Applying Aggregation Strategy to Logical Plan (Executing Aggregation) — `apply` Method



apply(plan: LogicalPlan): Seq[SparkPlan]

apply(plan: LogicalPlan): Seq[SparkPlan]

Note	`apply` is part of GenericStrategy Contract to generate a collection of SparkPlans for a given logical plan.

apply requests PhysicalAggregation extractor for Aggregate logical operators and creates a single aggregate physical operator for every Aggregate logical operator found.

Internally, apply requests PhysicalAggregation to destructure a Aggregate logical operator (into a four-element tuple) and splits aggregate expressions per whether they are distinct or not (using their isDistinct flag).

apply then creates a physical operator using the following helper methods:

AggUtils.planAggregateWithoutDistinct when no distinct aggregate expression is used
AggUtils.planAggregateWithOneDistinct when at least one distinct aggregate expression is used.

PushDownOperatorsToDataSource

2013-04-08admin阅读(1888)

PushDownOperatorsToDataSource Logical Optimization

PushDownOperatorsToDataSource is a logical optimization that pushes down operators to underlying data sources (i.e. DataSourceV2Relations) (before planning so that data source can report statistics more accurately).

Technically, PushDownOperatorsToDataSource is a Catalyst rule for transforming logical plans, i.e. Rule[LogicalPlan].

PushDownOperatorsToDataSource is part of the Push down operators to data source scan once-executed rule batch of the SparkOptimizer.

Executing Rule — `apply` Method



apply(plan: LogicalPlan): LogicalPlan

apply(plan: LogicalPlan): LogicalPlan

Note	`apply` is part of the Rule Contract to execute (apply) a rule on a TreeNode (e.g. LogicalPlan).

apply…FIXME

`pushDownRequiredColumns` Internal Method



pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: AttributeSet): LogicalPlan

pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: AttributeSet): LogicalPlan

pushDownRequiredColumns branches off per the input logical operator (that is supposed to have at least one child node):

For Project unary logical operator, pushDownRequiredColumns takes the references of the project expressions as the required columns (attributes) and executes itself recursively on the child logical operator

Note that the input requiredByParent attributes are not considered in the required columns.
For Filter unary logical operator, pushDownRequiredColumns adds the references of the filter condition to the input requiredByParent attributes and executes itself recursively on the child logical operator
For DataSourceV2Relation unary logical operator, pushDownRequiredColumns…FIXME
For other logical operators, pushDownRequiredColumns simply executes itself (using TreeNode.mapChildren) recursively on the child nodes (logical operators)

Note	`pushDownRequiredColumns` is used exclusively when `PushDownOperatorsToDataSource` logical optimization is requested to execute.

Destructuring Logical Operator — `FilterAndProject.unapply` Method



unapply(plan: LogicalPlan): Option[(Seq[NamedExpression], Expression, DataSourceV2Relation)]

unapply(plan: LogicalPlan): Option[(Seq[NamedExpression], Expression, DataSourceV2Relation)]

unapply is part of FilterAndProject extractor object to destructure the input logical operator into a tuple with…FIXME

unapply works with (matches) the following logical operators:

For a Filter with a DataSourceV2Relation leaf logical operator, unapply…FIXME
For a Filter with a Project over a DataSourceV2Relation leaf logical operator, unapply…FIXME
For others, unapply returns None (i.e. does nothing / does not match)

Note	`unapply` is used exclusively when `PushDownOperatorsToDataSource` logical optimization is requested to execute.

PruneFileSourcePartitions

2013-04-07admin阅读(1709)

PruneFileSourcePartitions Logical Optimization

PruneFileSourcePartitions is…FIXME

`apply` Method



apply(plan: LogicalPlan): LogicalPlan

apply(plan: LogicalPlan): LogicalPlan

Note	`apply` is part of Rule Contract to apply a rule to a TreeNode, e.g. logical query plan.

apply…FIXME

OptimizeMetadataOnlyQuery

2013-04-06admin阅读(1630)

OptimizeMetadataOnlyQuery Logical Optimization

OptimizeMetadataOnlyQuery is…FIXME

`apply` Method



apply(plan: LogicalPlan): LogicalPlan

apply(plan: LogicalPlan): LogicalPlan

Note	`apply` is part of Rule Contract to apply a rule to a TreeNode, e.g. logical query plan.

apply…FIXME

ExtractPythonUDFFromAggregate

2013-04-05admin阅读(1575)

ExtractPythonUDFFromAggregate Logical Optimization

ExtractPythonUDFFromAggregate is…FIXME

`apply` Method



apply(plan: LogicalPlan): LogicalPlan

apply(plan: LogicalPlan): LogicalPlan

Note	`apply` is part of Rule Contract to apply a rule to a TreeNode, e.g. logical query plan.

apply…FIXME

SimplifyCasts

2013-04-04admin阅读(1055)

SimplifyCasts Logical Optimization

SimplifyCasts is a base logical optimization that eliminates redundant casts in the following cases:

The input is already the type to cast to.
The input is of ArrayType or MapType type and contains no null elements.

SimplifyCasts is part of the Operator Optimization before Inferring Filters fixed-point batch in the standard batches of the Catalyst Optimizer.

SimplifyCasts is simply a Catalyst rule for transforming logical plans, i.e. Rule[LogicalPlan].



// Case 1. The input is already the type to cast to
scala> val ds = spark.range(1)
ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> ds.printSchema
root
 |-- id: long (nullable = false)

scala> ds.selectExpr("CAST (id AS long)").explain(true)
...
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===
!Project [cast(id#0L as bigint) AS id#7L]   Project [id#0L AS id#7L]
 +- Range (0, 1, step=1, splits=Some(8))    +- Range (0, 1, step=1, splits=Some(8))

TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.RemoveAliasOnlyProject ===
!Project [id#0L AS id#7L]                  Range (0, 1, step=1, splits=Some(8))
!+- Range (0, 1, step=1, splits=Some(8))

TRACE SparkOptimizer: Fixed point reached for batch Operator Optimizations after 2 iterations.
DEBUG SparkOptimizer:
=== Result of Batch Operator Optimizations ===
!Project [cast(id#0L as bigint) AS id#7L]   Range (0, 1, step=1, splits=Some(8))
!+- Range (0, 1, step=1, splits=Some(8))
...
== Parsed Logical Plan ==
'Project [unresolvedalias(cast('id as bigint), None)]
+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint
Project [cast(id#0L as bigint) AS id#7L]
+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*Range (0, 1, step=1, splits=Some(8))

// Case 2A. The input is of `ArrayType` type and contains no `null` elements.
scala> val intArray = Seq(Array(1)).toDS
intArray: org.apache.spark.sql.Dataset[Array[Int]] = [value: array<int>]

scala> intArray.printSchema
root
 |-- value: array (nullable = true)
 |    |-- element: integer (containsNull = false)

scala> intArray.map(arr => arr.sum).explain(true)
...
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===
 SerializeFromObject [input[0, int, true] AS value#36]                                                       SerializeFromObject [input[0, int, true] AS value#36]
 +- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int   +- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int
!   +- DeserializeToObject cast(value#15 as array<int>).toIntArray, obj#34: [I                                  +- DeserializeToObject value#15.toIntArray, obj#34: [I
       +- LocalRelation [value#15]                                                                                 +- LocalRelation [value#15]

TRACE SparkOptimizer: Fixed point reached for batch Operator Optimizations after 2 iterations.
DEBUG SparkOptimizer:
=== Result of Batch Operator Optimizations ===
 SerializeFromObject [input[0, int, true] AS value#36]                                                       SerializeFromObject [input[0, int, true] AS value#36]
 +- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int   +- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int
!   +- DeserializeToObject cast(value#15 as array<int>).toIntArray, obj#34: [I                                  +- DeserializeToObject value#15.toIntArray, obj#34: [I
       +- LocalRelation [value#15]                                                                                 +- LocalRelation [value#15]
...
== Parsed Logical Plan ==
'SerializeFromObject [input[0, int, true] AS value#36]
+- 'MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int
   +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(IntegerType,false)), ArrayType(IntegerType,false), - root class: "scala.Array").toIntArray), obj#34: [I
      +- LocalRelation [value#15]

== Analyzed Logical Plan ==
value: int
SerializeFromObject [input[0, int, true] AS value#36]
+- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int
   +- DeserializeToObject cast(value#15 as array<int>).toIntArray, obj#34: [I
      +- LocalRelation [value#15]

== Optimized Logical Plan ==
SerializeFromObject [input[0, int, true] AS value#36]
+- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int
   +- DeserializeToObject value#15.toIntArray, obj#34: [I
      +- LocalRelation [value#15]

== Physical Plan ==
*SerializeFromObject [input[0, int, true] AS value#36]
+- *MapElements <function1>, obj#35: int
   +- *DeserializeToObject value#15.toIntArray, obj#34: [I
      +- LocalTableScan [value#15]

// Case 2B. The input is of `MapType` type and contains no `null` elements.
scala> val mapDF = Seq(("one", 1), ("two", 2)).toDF("k", "v").withColumn("m", map(col("k"), col("v")))
mapDF: org.apache.spark.sql.DataFrame = [k: string, v: int ... 1 more field]

scala> mapDF.printSchema
root
 |-- k: string (nullable = true)
 |-- v: integer (nullable = false)
 |-- m: map (nullable = false)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = false)

scala> mapDF.selectExpr("""CAST (m AS map<string, int>)""").explain(true)
...
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===
!Project [cast(map(_1#250, _2#251) as map<string,int>) AS m#272]   Project [map(_1#250, _2#251) AS m#272]
 +- LocalRelation [_1#250, _2#251]                                 +- LocalRelation [_1#250, _2#251]
...
== Parsed Logical Plan ==
'Project [unresolvedalias(cast('m as map<string,int>), None)]
+- Project [k#253, v#254, map(k#253, v#254) AS m#258]
   +- Project [_1#250 AS k#253, _2#251 AS v#254]
      +- LocalRelation [_1#250, _2#251]

== Analyzed Logical Plan ==
m: map<string,int>
Project [cast(m#258 as map<string,int>) AS m#272]
+- Project [k#253, v#254, map(k#253, v#254) AS m#258]
   +- Project [_1#250 AS k#253, _2#251 AS v#254]
      +- LocalRelation [_1#250, _2#251]

== Optimized Logical Plan ==
LocalRelation [m#272]

== Physical Plan ==
LocalTableScan [m#272]

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

// Case 1. The input is already the type to cast to

scala> val ds = spark.range(1)

ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> ds.printSchema

root

|-- id: long (nullable = false)

scala> ds.selectExpr("CAST (id AS long)").explain(true)

...

TRACE SparkOptimizer:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===

!Project [cast(id#0L as bigint) AS id#7L] Project [id#0L AS id#7L]

+- Range (0, 1, step=1, splits=Some(8)) +- Range (0, 1, step=1, splits=Some(8))

TRACE SparkOptimizer:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.RemoveAliasOnlyProject ===

!Project [id#0L AS id#7L] Range (0, 1, step=1, splits=Some(8))

!+- Range (0, 1, step=1, splits=Some(8))

TRACE SparkOptimizer: Fixed point reached for batch Operator Optimizations after 2 iterations.

DEBUG SparkOptimizer:

=== Result of Batch Operator Optimizations ===

!Project [cast(id#0L as bigint) AS id#7L] Range (0, 1, step=1, splits=Some(8))

!+- Range (0, 1, step=1, splits=Some(8))

...

== Parsed Logical Plan ==

'Project [unresolvedalias(cast('id as bigint), None)]

+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==

id: bigint

Project [cast(id#0L as bigint) AS id#7L]

+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==

Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==

*Range (0, 1, step=1, splits=Some(8))

// Case 2A. The input is of `ArrayType` type and contains no `null` elements.

scala> val intArray = Seq(Array(1)).toDS

intArray: org.apache.spark.sql.Dataset[Array[Int]] = [value: array<int>]

scala> intArray.printSchema

root

|-- value: array (nullable = true)

| |-- element: integer (containsNull = false)

scala> intArray.map(arr => arr.sum).explain(true)

...

TRACE SparkOptimizer:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===

SerializeFromObject [input[0, int, true] AS value#36] SerializeFromObject [input[0, int, true] AS value#36]

+- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int +- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int

! +- DeserializeToObject cast(value#15 as array<int>).toIntArray, obj#34: [I +- DeserializeToObject value#15.toIntArray, obj#34: [I

+- LocalRelation [value#15] +- LocalRelation [value#15]

TRACE SparkOptimizer: Fixed point reached for batch Operator Optimizations after 2 iterations.

DEBUG SparkOptimizer:

=== Result of Batch Operator Optimizations ===

SerializeFromObject [input[0, int, true] AS value#36] SerializeFromObject [input[0, int, true] AS value#36]

! +- DeserializeToObject cast(value#15 as array<int>).toIntArray, obj#34: [I +- DeserializeToObject value#15.toIntArray, obj#34: [I

+- LocalRelation [value#15] +- LocalRelation [value#15]

...

== Parsed Logical Plan ==

'SerializeFromObject [input[0, int, true] AS value#36]

+- 'MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int

+- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(IntegerType,false)), ArrayType(IntegerType,false), - root class: "scala.Array").toIntArray), obj#34: [I

+- LocalRelation [value#15]

== Analyzed Logical Plan ==

value: int

SerializeFromObject [input[0, int, true] AS value#36]

+- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int

+- DeserializeToObject cast(value#15 as array<int>).toIntArray, obj#34: [I

+- LocalRelation [value#15]

== Optimized Logical Plan ==

SerializeFromObject [input[0, int, true] AS value#36]

+- MapElements <function1>, class [I, [StructField(value,ArrayType(IntegerType,false),true)], obj#35: int

+- DeserializeToObject value#15.toIntArray, obj#34: [I

+- LocalRelation [value#15]

== Physical Plan ==

*SerializeFromObject [input[0, int, true] AS value#36]

+- *MapElements <function1>, obj#35: int

+- *DeserializeToObject value#15.toIntArray, obj#34: [I

+- LocalTableScan [value#15]

// Case 2B. The input is of `MapType` type and contains no `null` elements.

scala> val mapDF = Seq(("one", 1), ("two", 2)).toDF("k", "v").withColumn("m", map(col("k"), col("v")))

mapDF: org.apache.spark.sql.DataFrame = [k: string, v: int ... 1 more field]

scala> mapDF.printSchema

root

|-- k: string (nullable = true)

|-- v: integer (nullable = false)

|-- m: map (nullable = false)

| |-- key: string

| |-- value: integer (valueContainsNull = false)

scala> mapDF.selectExpr("""CAST (m AS map<string, int>)""").explain(true)

...

TRACE SparkOptimizer:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===

!Project [cast(map(_1#250, _2#251) as map<string,int>) AS m#272] Project [map(_1#250, _2#251) AS m#272]

+- LocalRelation [_1#250, _2#251] +- LocalRelation [_1#250, _2#251]

...

== Parsed Logical Plan ==

'Project [unresolvedalias(cast('m as map<string,int>), None)]

+- Project [k#253, v#254, map(k#253, v#254) AS m#258]

+- Project [_1#250 AS k#253, _2#251 AS v#254]

+- LocalRelation [_1#250, _2#251]

== Analyzed Logical Plan ==

m: map<string,int>

Project [cast(m#258 as map<string,int>) AS m#272]

+- Project [k#253, v#254, map(k#253, v#254) AS m#258]

+- Project [_1#250 AS k#253, _2#251 AS v#254]

+- LocalRelation [_1#250, _2#251]

== Optimized Logical Plan ==

LocalRelation [m#272]

== Physical Plan ==

LocalTableScan [m#272]

Executing Rule — `apply` Method



apply(plan: LogicalPlan): LogicalPlan

apply(plan: LogicalPlan): LogicalPlan

Note	`apply` is part of the Rule Contract to execute (apply) a rule on a TreeNode (e.g. LogicalPlan).

apply…FIXME

RewritePredicateSubquery

2013-04-03admin阅读(1550)

RewritePredicateSubquery Logical Optimization

RewritePredicateSubquery is a base logical optimization that transforms Filter operators with Exists and In (with ListQuery) expressions to Join operators as follows:

Filter operators with Exists and In with ListQuery expressions give left-semi joins
Filter operators with Not with Exists and In with ListQuery expressions give left-anti joins

Note	Prefer `EXISTS` (over `Not` with `In` with `ListQuery` subquery expression) if performance matters since they say “that will almost certainly be planned as a Broadcast Nested Loop join”.

RewritePredicateSubquery is part of the RewriteSubquery once-executed batch in the standard batches of the Catalyst Optimizer.

RewritePredicateSubquery is simply a Catalyst rule for transforming logical plans, i.e. Rule[LogicalPlan].



// FIXME Examples of RewritePredicateSubquery
// 1. Filters with Exists and In (with ListQuery) expressions
// 2. NOTs

// Based on RewriteSubquerySuite
// FIXME Contribute back to RewriteSubquerySuite
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
import org.apache.spark.sql.catalyst.rules.RuleExecutor
object Optimize extends RuleExecutor[LogicalPlan] {
  import org.apache.spark.sql.catalyst.optimizer._
  val batches = Seq(
    Batch("Column Pruning", FixedPoint(100), ColumnPruning),
    Batch("Rewrite Subquery", Once,
      RewritePredicateSubquery,
      ColumnPruning,
      CollapseProject,
      RemoveRedundantProject))
}

val q = ...
val optimized = Optimize.execute(q.analyze)

// FIXME Examples of RewritePredicateSubquery

// 1. Filters with Exists and In (with ListQuery) expressions

// 2. NOTs

// Based on RewriteSubquerySuite

// FIXME Contribute back to RewriteSubquerySuite

import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

import org.apache.spark.sql.catalyst.rules.RuleExecutor

object Optimize extends RuleExecutor[LogicalPlan] {

import org.apache.spark.sql.catalyst.optimizer._

val batches = Seq(

Batch("Column Pruning", FixedPoint(100), ColumnPruning),

Batch("Rewrite Subquery", Once,

RewritePredicateSubquery,

ColumnPruning,

CollapseProject,

RemoveRedundantProject))

}

val q = ...

val optimized = Optimize.execute(q.analyze)

RewritePredicateSubquery is part of the RewriteSubquery once-executed batch in the standard batches of the Catalyst Optimizer.

`rewriteExistentialExpr` Internal Method



rewriteExistentialExpr(
  exprs: Seq[Expression],
  plan: LogicalPlan): (Option[Expression], LogicalPlan)

rewriteExistentialExpr(

exprs: Seq[Expression],

plan: LogicalPlan): (Option[Expression], LogicalPlan)

rewriteExistentialExpr…FIXME

Note	`rewriteExistentialExpr` is used when…FIXME

`dedupJoin` Internal Method



dedupJoin(joinPlan: LogicalPlan): LogicalPlan

dedupJoin(joinPlan: LogicalPlan): LogicalPlan

dedupJoin…FIXME

Note	`dedupJoin` is used when…FIXME

`getValueExpression` Internal Method



getValueExpression(e: Expression): Seq[Expression]

getValueExpression(e: Expression): Seq[Expression]

getValueExpression…FIXME

Note	`getValueExpression` is used when…FIXME

Executing Rule — `apply` Method



apply(plan: LogicalPlan): LogicalPlan

apply(plan: LogicalPlan): LogicalPlan

Note	`apply` is part of the Rule Contract to execute (apply) a rule on a TreeNode (e.g. LogicalPlan).

apply transforms Filter unary operators in the input logical plan.

apply splits conjunctive predicates in the condition expression (i.e. expressions separated by And expression) and then partitions them into two collections of expressions with and without In or Exists subquery expressions.

apply creates a Filter operator for condition (sub)expressions without subqueries (combined with And expression) if available or takes the child operator (of the input Filter unary operator).

In the end, apply creates a new logical plan with Join operators for Exists and In expressions (and their negations) as follows:

For Exists predicate expressions, apply rewriteExistentialExpr and creates a Join operator with LeftSemi join type. In the end, apply dedupJoin
For Not expressions with a Exists predicate expression, apply rewriteExistentialExpr and creates a Join operator with LeftAnti join type. In the end, apply dedupJoin
For In predicate expressions with a ListQuery subquery expression, apply getValueExpression followed by rewriteExistentialExpr and creates a Join operator with LeftSemi join type. In the end, apply dedupJoin
For Not expressions with a In predicate expression with a ListQuery subquery expression, apply getValueExpression, rewriteExistentialExpr followed by splitting conjunctive predicates and creates a Join operator with LeftAnti join type. In the end, apply dedupJoin
For other predicate expressions, apply rewriteExistentialExpr and creates a Project unary operator with a Filter operator

上一页
1
···
9
10
11
12
13
14
15
...
下一页
共 58 页

spark-sql 第12页

DataSourceV2Strategy Execution Planning Strategy

Applying DataSourceV2Strategy Strategy to Logical Plan (Executing DataSourceV2Strategy) — apply Method

DataSourceStrategy Execution Planning Strategy

pruneFilterProject Internal Method

Selecting Catalyst Expressions Convertible to Data Source Filter Predicates (and Handled by BaseRelation) — selectFilters Method

Translating Catalyst Expression Into Data Source Filter Predicate — translateFilter Method

RDD Conversion (Converting RDD of Rows to Catalyst RDD of InternalRows) — toCatalystRDD Internal Method

Creating RowDataSourceScanExec Physical Operator for LogicalRelation (Possibly Under FilterExec and ProjectExec Operators) — pruneFilterProjectRaw Internal Method

BasicOperators Execution Planning Strategy

Aggregation Execution Planning Strategy for Aggregate Physical Operators

Applying Aggregation Strategy to Logical Plan (Executing Aggregation) — apply Method

PushDownOperatorsToDataSource Logical Optimization

Executing Rule — apply Method

pushDownRequiredColumns Internal Method

Destructuring Logical Operator — FilterAndProject.unapply Method

PruneFileSourcePartitions Logical Optimization

apply Method

OptimizeMetadataOnlyQuery Logical Optimization

apply Method

ExtractPythonUDFFromAggregate Logical Optimization

apply Method

SimplifyCasts Logical Optimization

Executing Rule — apply Method

RewritePredicateSubquery Logical Optimization

rewriteExistentialExpr Internal Method

dedupJoin Internal Method

getValueExpression Internal Method

Executing Rule — apply Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Applying DataSourceV2Strategy Strategy to Logical Plan (Executing DataSourceV2Strategy) — `apply` Method

`pruneFilterProject` Internal Method

Selecting Catalyst Expressions Convertible to Data Source Filter Predicates (and Handled by BaseRelation) — `selectFilters` Method

Translating Catalyst Expression Into Data Source Filter Predicate — `translateFilter` Method

RDD Conversion (Converting RDD of Rows to Catalyst RDD of InternalRows) — `toCatalystRDD` Internal Method

Creating RowDataSourceScanExec Physical Operator for LogicalRelation (Possibly Under FilterExec and ProjectExec Operators) — `pruneFilterProjectRaw` Internal Method

Applying Aggregation Strategy to Logical Plan (Executing Aggregation) — `apply` Method

Executing Rule — `apply` Method

`pushDownRequiredColumns` Internal Method

Destructuring Logical Operator — `FilterAndProject.unapply` Method

`apply` Method

`apply` Method

`apply` Method

Executing Rule — `apply` Method

`rewriteExistentialExpr` Internal Method

`dedupJoin` Internal Method

`getValueExpression` Internal Method

Executing Rule — `apply` Method