spark-sql-spark技术分享-第39页

BasicStatsPlanVisitor — Computing Statistics for Cost-Based Optimization

2012-07-17admin阅读(3199)

BasicStatsPlanVisitor — Computing Statistics for Cost-Based Optimization

BasicStatsPlanVisitor is a LogicalPlanVisitor that computes the statistics of a logical query plan for cost-based optimization (i.e. when cost-based optimization is enabled).

Note	Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is on, i.e. `true`, and is disabled by default.

BasicStatsPlanVisitor is used exclusively when a logical operator is requested for the statistics with cost-based optimization enabled.

BasicStatsPlanVisitor comes with custom handlers for a few logical operators and falls back to SizeInBytesOnlyStatsPlanVisitor for the others.

Table 1. BasicStatsPlanVisitor’s Visitor Handlers
Logical Operator	Handler	Behaviour
Aggregate	visitAggregate	Requests `AggregateEstimation` for statistics estimates and query hints or falls back to SizeInBytesOnlyStatsPlanVisitor
`Filter`	visitFilter	Requests `FilterEstimation` for statistics estimates and query hints or falls back to SizeInBytesOnlyStatsPlanVisitor
Join	visitJoin	Requests `JoinEstimation` for statistics estimates and query hints or falls back to SizeInBytesOnlyStatsPlanVisitor
Project	visitProject	Requests `ProjectEstimation` for statistics estimates and query hints or falls back to SizeInBytesOnlyStatsPlanVisitor

SizeInBytesOnlyStatsPlanVisitor — LogicalPlanVisitor for Total Size (in Bytes) Statistic Only

2012-07-16admin阅读(2285)

SizeInBytesOnlyStatsPlanVisitor — LogicalPlanVisitor for Total Size (in Bytes) Statistic Only

SizeInBytesOnlyStatsPlanVisitor is a LogicalPlanVisitor that computes a single dimension for plan statistics, i.e. the total size (in bytes).

`default` Method



default(p: LogicalPlan): Statistics

1

2

3

4

5

default(p: LogicalPlan): Statistics

Note	`default` is part of LogicalPlanVisitor Contract to compute the size statistic (in bytes) of a logical operator.

default requests a leaf logical operator for the statistics or creates a Statistics with the product of the sizeInBytes statistic of every child operator.

Note	`default` uses the cache of the estimated statistics of a logical operator so the statistics of an operator is computed once until it is invalidated.

`visitIntersect` Method



visitIntersect(p: Intersect): Statistics

1

2

3

4

5

visitIntersect(p: Intersect): Statistics

Note	`visitIntersect` is part of LogicalPlanVisitor Contract to…FIXME.

visitIntersect…FIXME

`visitJoin` Method



visitJoin(p: Join): Statistics

1

2

3

4

5

visitJoin(p: Join): Statistics

Note	`visitJoin` is part of LogicalPlanVisitor Contract to…FIXME.

visitJoin…FIXME

LogicalPlanVisitor — Base Visitor for Computing Statistics of Logical Plan

2012-07-15admin阅读(2033)

LogicalPlanVisitor — Contract for Computing Statistic Estimates and Query Hints of Logical Plan

LogicalPlanVisitor is the contract that uses the visitor design pattern to scan a logical query plan and compute estimates of plan statistics and query hints.

Tip	Read about the visitor design pattern in Wikipedia.

LogicalPlanVisitor defines visit method that dispatches computing the statistics of a logical plan to the corresponding handler methods.



visit(p: LogicalPlan): T

1

2

3

4

5

visit(p: LogicalPlan): T

Note	`T` stands for the type of a result to be computed (while visiting the query plan tree) and is currently always Statistics only.

The concrete LogicalPlanVisitor is chosen per spark.sql.cbo.enabled configuration property. When turned on (i.e. true), LogicalPlanStats uses BasicStatsPlanVisitor while SizeInBytesOnlyStatsPlanVisitor otherwise.

Note	spark.sql.cbo.enabled configuration property is off, i.e. `false` by default.

Table 1. LogicalPlanVisitors
LogicalPlanVisitor	Description
BasicStatsPlanVisitor
SizeInBytesOnlyStatsPlanVisitor

Table 2. LogicalPlanVisitor’s Logical Operators and Their Handlers
Logical Operator	Handler
Aggregate	`visitAggregate`
`Distinct`	`visitDistinct`
`Except`	`visitExcept`
Expand	`visitExpand`
`Filter`	`visitFilter`
Generate	`visitGenerate`
`GlobalLimit`	`visitGlobalLimit`
`Intersect`	`visitIntersect`
Join	`visitJoin`
`LocalLimit`	`visitLocalLimit`
Pivot	`visitPivot`
Project	`visitProject`
Repartition	`visitRepartition`
RepartitionByExpression	`visitRepartitionByExpr`
ResolvedHint	`visitHint`
`Sample`	`visitSample`
`ScriptTransformation`	`visitScriptTransform`
`Union`	`visitUnion`
Window	`visitWindow`
Other logical operators	`default`

HintInfo

2012-07-14admin阅读(4406)

HintInfo

HintInfo takes a single broadcast flag when created.

HintInfo is created when:

Dataset.broadcast function is used
ResolveBroadcastHints logical resolution rule is executed (and resolves UnresolvedHint logical operators)
ResolvedHint and Statistics are created
InMemoryRelation is requested for computeStats (when sizeInBytesStats is 0)
HintInfo is requested to resetForJoin

broadcast is used to…FIXME

broadcast is off (i.e. false) by default.



import org.apache.spark.sql.catalyst.plans.logical.HintInfo
val broadcastOff = HintInfo()

scala> println(broadcastOff.broadcast)
false

val broadcastOn = broadcastOff.copy(broadcast = true)
scala> println(broadcastOn)
(broadcast)

val broadcastOff = broadcastOn.resetForJoin
scala> println(broadcastOff.broadcast)
false

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

import org.apache.spark.sql.catalyst.plans.logical.HintInfo

val broadcastOff = HintInfo()

scala> println(broadcastOff.broadcast)

false

val broadcastOn = broadcastOff.copy(broadcast = true)

scala> println(broadcastOn)

(broadcast)

val broadcastOff = broadcastOn.resetForJoin

scala> println(broadcastOff.broadcast)

false

`resetForJoin` Method



resetForJoin(): HintInfo

1

2

3

4

5

resetForJoin(): HintInfo

resetForJoin…FIXME

Note	`resetForJoin` is used when `SizeInBytesOnlyStatsPlanVisitor` is requested to visitIntersect and visitJoin.

Statistics — Estimates of Plan Statistics and Query Hints

2012-07-13admin阅读(4162)

Statistics — Estimates of Plan Statistics and Query Hints

Statistics holds the statistics estimates and query hints of a logical operator:

Total (output) size (in bytes)
Estimated number of rows (aka row count)
Column attribute statistics (aka column (equi-height) histograms)
Query hints

Note	Cost statistics, plan statistics or query statistics are all synonyms and used interchangeably.

You can access statistics and query hints of a logical plan using stats property.



val q = spark.range(5).hint("broadcast").join(spark.range(1), "id")
val plan = q.queryExecution.optimizedPlan
val stats = plan.stats

scala> :type stats
org.apache.spark.sql.catalyst.plans.logical.Statistics

scala> println(stats.simpleString)
sizeInBytes=213.0 B, hints=none

1

2

3

4

5

6

7

8

9

10

11

12

13

val q = spark.range(5).hint("broadcast").join(spark.range(1), "id")

val plan = q.queryExecution.optimizedPlan

val stats = plan.stats

scala> :type stats

org.apache.spark.sql.catalyst.plans.logical.Statistics

scala> println(stats.simpleString)

sizeInBytes=213.0 B, hints=none

Note	Use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute total size and row count statistics of a table.

Note	Use ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL Command to generate column (equi-height) histograms of a table.

Note	Use `Dataset.hint` or `SELECT` SQL statement with hints to specify query hints.

Statistics is created when:

Leaf logical operators (specifically) and logical operators (in general) are requested for statistics estimates
HiveTableRelation and LogicalRelation are requested for statistics estimates (through CatalogStatistics)

Note	row count estimate is used in CostBasedJoinReorder logical optimization when cost-based optimization is enabled.

Note

CatalogStatistics is a “subset” of all possible Statistics (as there are no concepts of attributes and query hints in metastore).

CatalogStatistics are statistics stored in an external catalog (usually a Hive metastore) and are often referred as Hive statistics while Statistics represents the Spark statistics.

Statistics comes with simpleString method that is used for the readable text representation (that is toString with Statistics prefix).



import org.apache.spark.sql.catalyst.plans.logical.Statistics
import org.apache.spark.sql.catalyst.plans.logical.HintInfo
val stats = Statistics(sizeInBytes = 10, rowCount = Some(20), hints = HintInfo(broadcast = true))

scala> println(stats)
Statistics(sizeInBytes=10.0 B, rowCount=20, hints=(broadcast))

scala> println(stats.simpleString)
sizeInBytes=10.0 B, rowCount=20, hints=(broadcast)

1

2

3

4

5

6

7

8

9

10

11

12

13

import org.apache.spark.sql.catalyst.plans.logical.Statistics

import org.apache.spark.sql.catalyst.plans.logical.HintInfo

val stats = Statistics(sizeInBytes = 10, rowCount = Some(20), hints = HintInfo(broadcast = true))

scala> println(stats)

Statistics(sizeInBytes=10.0 B, rowCount=20, hints=(broadcast))

scala> println(stats.simpleString)

sizeInBytes=10.0 B, rowCount=20, hints=(broadcast)

LogicalPlanStats — Statistics Estimates and Query Hints of Logical Operator

2012-07-12admin阅读(1553)

LogicalPlanStats — Statistics Estimates and Query Hints of Logical Operator

LogicalPlanStats adds statistics support to logical operators and is used for query planning (with or without cost-based optimization, e.g. CostBasedJoinReorder or JoinSelection, respectively).

With LogicalPlanStats every logical operator has statistics that are computed only once when requested and are cached until invalidated and requested again.

Depending on cost-based optimization being enabled or not, stats computes the statistics with FIXME or FIXME, respectively.

Note	Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. `true`, and is disabled by default.

Use EXPLAIN COST SQL command to explain a query with the statistics.



scala> sql("EXPLAIN COST SHOW TABLES").as[String].collect.foreach(println)
== Optimized Logical Plan ==
ShowTablesCommand false, Statistics(sizeInBytes=1.0 B, hints=none)

== Physical Plan ==
Execute ShowTablesCommand
   +- ShowTablesCommand false

1

2

3

4

5

6

7

8

9

10

11

scala> sql("EXPLAIN COST SHOW TABLES").as[String].collect.foreach(println)

== Optimized Logical Plan ==

ShowTablesCommand false, Statistics(sizeInBytes=1.0 B, hints=none)

== Physical Plan ==

Execute ShowTablesCommand

+- ShowTablesCommand false

You can also access the statistics of a logical plan directly using stats method or indirectly requesting QueryExecution for text representation with statistics.



val q = sql("SHOW TABLES")
scala> println(q.queryExecution.analyzed.stats)
Statistics(sizeInBytes=1.0 B, hints=none)

scala> println(q.queryExecution.stringWithStats)
== Optimized Logical Plan ==
ShowTablesCommand false, Statistics(sizeInBytes=1.0 B, hints=none)

== Physical Plan ==
Execute ShowTablesCommand
   +- ShowTablesCommand false

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

val q = sql("SHOW TABLES")

scala> println(q.queryExecution.analyzed.stats)

Statistics(sizeInBytes=1.0 B, hints=none)

scala> println(q.queryExecution.stringWithStats)

== Optimized Logical Plan ==

ShowTablesCommand false, Statistics(sizeInBytes=1.0 B, hints=none)

== Physical Plan ==

Execute ShowTablesCommand

+- ShowTablesCommand false



val names = Seq((1, "one"), (2, "two")).toDF("id", "name")

// CBO is turned off by default
scala> println(spark.sessionState.conf.cboEnabled)
false

// CBO is disabled and so only sizeInBytes stat is available
// FIXME Why is analyzed required (not just logical)?
val namesStatsCboOff = names.queryExecution.analyzed.stats
scala> println(namesStatsCboOff)
Statistics(sizeInBytes=48.0 B, hints=none)

// Turn CBO on
import org.apache.spark.sql.internal.SQLConf
spark.sessionState.conf.setConf(SQLConf.CBO_ENABLED, true)

// Make sure that CBO is really enabled
scala> println(spark.sessionState.conf.cboEnabled)
true

// Invalidate the stats cache
names.queryExecution.analyzed.invalidateStatsCache

// Check out the statistics
val namesStatsCboOn = names.queryExecution.analyzed.stats
scala> println(namesStatsCboOn)
Statistics(sizeInBytes=48.0 B, hints=none)

// Despite CBO enabled, we can only get sizeInBytes stat
// That's because names is a LocalRelation under the covers
scala> println(names.queryExecution.optimizedPlan.numberedTreeString)
00 LocalRelation [id#5, name#6]

// LocalRelation triggers BasicStatsPlanVisitor to execute default case
// which is exactly as if we had CBO turned off

// Let's register names as a managed table
// That will change the rules of how stats are computed
import org.apache.spark.sql.SaveMode
names.write.mode(SaveMode.Overwrite).saveAsTable("names")

scala> spark.catalog.tableExists("names")
res5: Boolean = true

scala> spark.catalog.listTables.filter($"name" === "names").show
+-----+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+-----+--------+-----------+---------+-----------+
|names| default|       null|  MANAGED|      false|
+-----+--------+-----------+---------+-----------+

val namesTable = spark.table("names")

// names is a managed table now
// And Relation (not LocalRelation)
scala> println(namesTable.queryExecution.optimizedPlan.numberedTreeString)
00 Relation[id#32,name#33] parquet

// Check out the statistics
val namesStatsCboOn = namesTable.queryExecution.analyzed.stats
scala> println(namesStatsCboOn)
Statistics(sizeInBytes=1064.0 B, hints=none)

// Nothing has really changed, hasn't it?
// Well, sizeInBytes is bigger, but that's the only stat available
// row count stat requires ANALYZE TABLE with no NOSCAN option
sql("ANALYZE TABLE names COMPUTE STATISTICS")

// Invalidate the stats cache
namesTable.queryExecution.analyzed.invalidateStatsCache

// No change?! How so?
val namesStatsCboOn = namesTable.queryExecution.analyzed.stats
scala> println(namesStatsCboOn)
Statistics(sizeInBytes=1064.0 B, hints=none)

// Use optimized logical plan instead
val namesTableStats = spark.table("names").queryExecution.optimizedPlan.stats
scala> println(namesTableStats)
Statistics(sizeInBytes=64.0 B, rowCount=2, hints=none)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

val names = Seq((1, "one"), (2, "two")).toDF("id", "name")

// CBO is turned off by default

scala> println(spark.sessionState.conf.cboEnabled)

false

// CBO is disabled and so only sizeInBytes stat is available

// FIXME Why is analyzed required (not just logical)?

val namesStatsCboOff = names.queryExecution.analyzed.stats

scala> println(namesStatsCboOff)

Statistics(sizeInBytes=48.0 B, hints=none)

// Turn CBO on

import org.apache.spark.sql.internal.SQLConf

spark.sessionState.conf.setConf(SQLConf.CBO_ENABLED, true)

// Make sure that CBO is really enabled

scala> println(spark.sessionState.conf.cboEnabled)

true

// Invalidate the stats cache

names.queryExecution.analyzed.invalidateStatsCache

// Check out the statistics

val namesStatsCboOn = names.queryExecution.analyzed.stats

scala> println(namesStatsCboOn)

Statistics(sizeInBytes=48.0 B, hints=none)

// Despite CBO enabled, we can only get sizeInBytes stat

// That's because names is a LocalRelation under the covers

scala> println(names.queryExecution.optimizedPlan.numberedTreeString)

00 LocalRelation [id#5, name#6]

// LocalRelation triggers BasicStatsPlanVisitor to execute default case

// which is exactly as if we had CBO turned off

// Let's register names as a managed table

// That will change the rules of how stats are computed

import org.apache.spark.sql.SaveMode

names.write.mode(SaveMode.Overwrite).saveAsTable("names")

scala> spark.catalog.tableExists("names")

res5: Boolean = true

scala> spark.catalog.listTables.filter($"name" === "names").show

+-----+--------+-----------+---------+-----------+

+-----+--------+-----------+---------+-----------+

+-----+--------+-----------+---------+-----------+

val namesTable = spark.table("names")

// names is a managed table now

// And Relation (not LocalRelation)

scala> println(namesTable.queryExecution.optimizedPlan.numberedTreeString)

00 Relation[id#32,name#33] parquet

// Check out the statistics

val namesStatsCboOn = namesTable.queryExecution.analyzed.stats

scala> println(namesStatsCboOn)

Statistics(sizeInBytes=1064.0 B, hints=none)

// Nothing has really changed, hasn't it?

// Well, sizeInBytes is bigger, but that's the only stat available

// row count stat requires ANALYZE TABLE with no NOSCAN option

sql("ANALYZE TABLE names COMPUTE STATISTICS")

// Invalidate the stats cache

namesTable.queryExecution.analyzed.invalidateStatsCache

// No change?! How so?

val namesStatsCboOn = namesTable.queryExecution.analyzed.stats

scala> println(namesStatsCboOn)

Statistics(sizeInBytes=1064.0 B, hints=none)

// Use optimized logical plan instead

val namesTableStats = spark.table("names").queryExecution.optimizedPlan.stats

scala> println(namesTableStats)

Statistics(sizeInBytes=64.0 B, rowCount=2, hints=none)

Note	The statistics of a Dataset are unaffected by caching it.

Note	`LogicalPlanStats` is a Scala trait with `self: LogicalPlan` as part of its definition. It is a very useful feature of Scala that restricts the set of classes that the trait could be used with (as well as makes the target subtype known at compile time).

Computing (and Caching) Statistics and Query Hints — `stats` Method



stats: Statistics

1

2

3

4

5

stats: Statistics

stats gets the statistics from statsCache if already computed. Otherwise, stats branches off per whether cost-based optimization is enabled or not.

Note

Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true, and is disabled by default.

Use SQLConf.cboEnabled to access the current value of spark.sql.cbo.enabled property.



// CBO is disabled by default
val sqlConf = spark.sessionState.conf
scala> println(sqlConf.cboEnabled)
false

1

2

3

4

5

6

7

8

// CBO is disabled by default

val sqlConf = spark.sessionState.conf

scala> println(sqlConf.cboEnabled)

false

With cost-based optimization disabled stats requests SizeInBytesOnlyStatsPlanVisitor to compute the statistics.

With cost-based optimization enabled stats requests BasicStatsPlanVisitor to compute the statistics.

In the end, statsCache caches the statistics for later use.

Note

stats is used when:

JoinSelection execution planning strategy matches a logical plan:
1. that is small enough for broadcast join (using BroadcastHashJoinExec or BroadcastNestedLoopJoinExec physical operators)
2. whose a single partition should be small enough to build a hash table (using ShuffledHashJoinExec physical operator)
3. that is much smaller (3X) than the other plan (for ShuffledHashJoinExec physical operator)
4. …
QueryExecution is requested for stringWithStats for EXPLAIN COST SQL command
CacheManager is requested to cache a Dataset or recacheByCondition
HiveMetastoreCatalog is requested for convertToLogicalRelation
StarSchemaDetection
CostBasedJoinReorder is executed (and does reordering)

Invalidating Statistics Cache (of All Operators in Logical Plan) — `invalidateStatsCache` Method



invalidateStatsCache(): Unit

1

2

3

4

5

invalidateStatsCache(): Unit

invalidateStatsCache clears statsCache of the current logical operators followed by requesting the child logical operators for the same.

SparkStrategies — Container of Execution Planning Strategies

2012-07-11admin阅读(1664)

SparkStrategies — Container of Execution Planning Strategies

SparkStrategies is an abstract Catalyst query planner that merely serves as a “container” (or a namespace) of the concrete execution planning strategies (for SparkPlanner):

Aggregation
BasicOperators
FlatMapGroupsWithStateStrategy
InMemoryScans
JoinSelection
SpecialLimits
StatefulAggregationStrategy
StreamingDeduplicationStrategy
StreamingRelationStrategy

SparkStrategies has a single lazily-instantiated singleRowRdd value that is an RDD of internal binary rows that BasicOperators execution planning strategy uses when resolving OneRowRelation (to RDDScanExec leaf physical operator).

Note	`OneRowRelation` logical operator represents SQL’s SELECT clause without FROM clause or EXPLAIN DESCRIBE TABLE.

SparkStrategy — Base for Execution Planning Strategies

2012-07-10admin阅读(1793)

SparkStrategy — Base for Execution Planning Strategies

SparkStrategy is a Catalyst GenericStrategy that converts a logical plan into zero or more physical plans.

SparkStrategy marks logical plans (i.e. LogicalPlan) to be planned later (by some other SparkStrategy or after other SparkStrategy strategies have finished) using PlanLater physical operator.



planLater(plan: LogicalPlan): SparkPlan = PlanLater(plan)

1

2

3

4

5

planLater(plan: LogicalPlan): SparkPlan = PlanLater(plan)

Note

SparkStrategy is used as Strategy type alias (aka type synonym) in Spark’s code base that is defined in org.apache.spark.sql package object, i.e.



type Strategy = SparkStrategy

1

2

3

4

5

type Strategy = SparkStrategy

PlanLater Physical Operator

Caution

FIXME

SparkPlanner — Spark Query Planner

2012-07-09admin阅读(1843)

SparkPlanner — Spark Query Planner

SparkPlanner is a concrete Catalyst Query Planner that converts a logical plan to one or more physical plans using execution planning strategies with support for extra strategies (by means of ExperimentalMethods) and extraPlanningStrategies.

Note	`SparkPlanner` is expected to plan (aka generate) at least one physical plan per logical plan.

SparkPlanner is available as planner of a SessionState.



val spark: SparkSession = ...
scala> :type spark.sessionState.planner
org.apache.spark.sql.execution.SparkPlanner

1

2

3

4

5

6

7

val spark: SparkSession = ...

scala> :type spark.sessionState.planner

org.apache.spark.sql.execution.SparkPlanner

Table 1. SparkPlanner’s Execution Planning Strategies (in execution order)
SparkStrategy	Description
`ExperimentalMethods`‘s extraStrategies
extraPlanningStrategies	Extension point for extra planning strategies
DataSourceV2Strategy
FileSourceStrategy
DataSourceStrategy
SpecialLimits
Aggregation
JoinSelection
InMemoryScans
BasicOperators

Note	`SparkPlanner` extends SparkStrategies abstract class.

Creating SparkPlanner Instance

SparkPlanner takes the following when created:

Note	`SparkPlanner` is created in: BaseSessionStateBuilder HiveSessionStateBuilder Structured Streaming’s `IncrementalExecution`

Extension Point for Extra Planning Strategies — `extraPlanningStrategies` Method



extraPlanningStrategies: Seq[Strategy] = Nil

1

2

3

4

5

extraPlanningStrategies: Seq[Strategy] = Nil

extraPlanningStrategies is an extension point to register extra planning strategies with the query planner.

Note	`extraPlanningStrategies` are executed after extraStrategies.

Note	`extraPlanningStrategies` is used when `SparkPlanner` is requested for planning strategies. `extraPlanningStrategies` is overriden in the `SessionState` builders — BaseSessionStateBuilder and HiveSessionStateBuilder.

Collecting PlanLater Physical Operators — `collectPlaceholders` Method



collectPlaceholders(plan: SparkPlan): Seq[(SparkPlan, LogicalPlan)]

1

2

3

4

5

collectPlaceholders(plan: SparkPlan): Seq[(SparkPlan, LogicalPlan)]

collectPlaceholders collects all PlanLater physical operators in the plan physical plan.

Note	`collectPlaceholders` is part of QueryPlanner Contract.

Pruning “Bad” Physical Plans — `prunePlans` Method



prunePlans(plans: Iterator[SparkPlan]): Iterator[SparkPlan]

1

2

3

4

5

prunePlans(plans: Iterator[SparkPlan]): Iterator[SparkPlan]

prunePlans gives the input plans physical plans back (i.e. with no changes).

Note	`prunePlans` is part of QueryPlanner Contract to remove somehow “bad” plans.

Creating Physical Operator (Possibly Under FilterExec and ProjectExec Operators) — `pruneFilterProject` Method



pruneFilterProject(
  projectList: Seq[NamedExpression],
  filterPredicates: Seq[Expression],
  prunePushedDownFilters: Seq[Expression] => Seq[Expression],
  scanBuilder: Seq[Attribute] => SparkPlan): SparkPlan

1

2

3

4

5

6

7

8

9

pruneFilterProject(

projectList: Seq[NamedExpression],

filterPredicates: Seq[Expression],

prunePushedDownFilters: Seq[Expression] => Seq[Expression],

scanBuilder: Seq[Attribute] => SparkPlan): SparkPlan

Note	`pruneFilterProject` is almost like DataSourceStrategy.pruneFilterProjectRaw.

pruneFilterProject branches off per whether it is possible to use a column pruning only (to get the right projection) and the input projectList columns of this projection are enough to evaluate all input filterPredicates filter conditions.

If so, pruneFilterProject does the following:

Applies the input scanBuilder function to the input projectList columns that creates a new physical operator
If there are Catalyst predicate expressions in the input prunePushedDownFilters that cannot be pushed down, pruneFilterProject creates a FilterExec unary physical operator (with the unhandled predicate expressions)
Otherwise, pruneFilterProject simply returns the physical operator

Note	In this case no extra ProjectExec unary physical operator is created.

If not (i.e. it is neither possible to use a column pruning only nor evaluate filter conditions), pruneFilterProject does the following:

Applies the input scanBuilder function to the projection and filtering columns that creates a new physical operator
Creates a FilterExec unary physical operator (with the unhandled predicate expressions if available)
Creates a ProjectExec unary physical operator with the optional FilterExec operator (with the scan physical operator) or simply the scan physical operator alone

Note	`pruneFilterProject` is used when: `HiveTableScans` execution planning strategy is executed (to plan HiveTableRelation leaf logical operators) `InMemoryScans` execution planning strategy is executed (to plan InMemoryRelation leaf logical operators)

Catalyst Optimizer — Generic Logical Query Plan Optimizer

2012-07-08admin阅读(2091)

Catalyst Optimizer — Generic Logical Query Plan Optimizer

Optimizer (aka Catalyst Optimizer) is the base of logical query plan optimizers that defines the rule batches of logical optimizations (i.e. logical optimizations that are the rules that transform the query plan of a structured query to produce the optimized logical plan).

Note	SparkOptimizer is the one and only direct implementation of the `Optimizer` Contract in Spark SQL.

Optimizer is a RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]).



Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan

1

2

3

4

5

Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan

Optimizer is available as the optimizer property of a session-specific SessionState.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer

1

2

3

4

5

6

7

8

9

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.optimizer

org.apache.spark.sql.catalyst.optimizer.Optimizer

You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL’s EXPLAIN EXTENDED SQL command.



// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

// sample structured query

val inventory = spark

.range(5)

.withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)

scala> inventory.explain(extended = true)

== Parsed Logical Plan ==

'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]

+- AnalysisBarrier

+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==

id: bigint, new_column: bigint

Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]

+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==

Project [id#0L, (id#0L + 5) AS new_column#3L]

+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==

*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]

+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its optimizedPlan property (that together with numberedTreeString method is a very good “debugging” tool).



val optimizedPlan = inventory.queryExecution.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + 5) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

1

2

3

4

5

6

7

8

val optimizedPlan = inventory.queryExecution.optimizedPlan

scala> println(optimizedPlan.numberedTreeString)

00 Project [id#0L, (id#0L + 5) AS new_column#3L]

01 +- Range (0, 5, step=1, splits=Some(8))

Optimizer defines the default rule batches that are considered the base rule batches that can be further refined (extended or with some rules excluded).

Table 1. Optimizer’s Default Optimization Rule Batches (in the order of execution)
Batch Name	Strategy	Rules	Description
Eliminate Distinct	`Once`	EliminateDistinct
Finish Analysis	`Once`	EliminateSubqueryAliases	Removes (eliminates) SubqueryAlias unary logical operators from a logical plan
		EliminateView	Removes (eliminates) View unary logical operators from a logical plan and replaces them with their child logical operator
		ReplaceExpressions	Replaces RuntimeReplaceable expressions with their single child expression
		ComputeCurrentTime
		GetCurrentDatabase
		RewriteDistinctAggregates
		ReplaceDeduplicateWithAggregate
Union	`Once`	CombineUnions
LocalRelation early	FixedPoint	ConvertToLocalRelation
LocalRelation early	FixedPoint	PropagateEmptyRelation
Pullup Correlated Expressions	`Once`	PullupCorrelatedPredicates
Subquery	`Once`	OptimizeSubqueries
Replace Operators	FixedPoint	RewriteExceptAll
		RewriteIntersectAll
		ReplaceIntersectWithSemiJoin
		ReplaceExceptWithFilter
		ReplaceExceptWithAntiJoin
		ReplaceDistinctWithAggregate
Aggregate	FixedPoint	RemoveLiteralFromGroupExpressions
Aggregate	FixedPoint	RemoveRepetitionFromGroupExpressions
operatorOptimizationBatch
Join Reorder	`Once`	CostBasedJoinReorder	Reorders Join logical operators
Remove Redundant Sorts	`Once`	RemoveRedundantSorts
Decimal Optimizations	FixedPoint	DecimalAggregates
Object Expressions Optimization	FixedPoint	EliminateMapObjects
Object Expressions Optimization	FixedPoint	CombineTypedFilters
LocalRelation	FixedPoint	ConvertToLocalRelation
LocalRelation	FixedPoint	PropagateEmptyRelation
Extract PythonUDF From JoinCondition	`Once`	PullOutPythonUDFInJoinCondition
Check Cartesian Products	`Once`	CheckCartesianProducts
RewriteSubquery	`Once`	RewritePredicateSubquery
		ColumnPruning
		CollapseProject
		RemoveRedundantProject
UpdateAttributeReferences	`Once`	UpdateNullabilityInAttributeReferences

Tip	Consult the sources of the `Optimizer` class for the up-to-date list of the default optimization rule batches.

Optimizer defines the operator optimization rules with the extendedOperatorOptimizationRules extension point for additional optimizations in the Operator Optimization batch.

Table 2. Optimizer’s Operator Optimization Rules (in the order of execution)
Rule Name	Description
PushProjectionThroughUnion
ReorderJoin
EliminateOuterJoin
PushPredicateThroughJoin
PushDownPredicate
LimitPushDown
ColumnPruning
CollapseRepartition
CollapseProject
CollapseWindow	Collapses two adjacent Window logical operators
CombineFilters
CombineLimits
CombineUnions
NullPropagation
ConstantPropagation
FoldablePropagation
OptimizeIn
ConstantFolding
ReorderAssociativeOperator
LikeSimplification
BooleanSimplification
SimplifyConditionals
RemoveDispensableExpressions
SimplifyBinaryComparison
PruneFilters
EliminateSorts
SimplifyCasts
SimplifyCaseConversionExpressions
RewriteCorrelatedScalarSubquery
EliminateSerialization
RemoveRedundantAliases
RemoveRedundantProject
SimplifyExtractValueOps
CombineConcats

Optimizer defines Operator Optimization Batch that is simply a collection of rule batches with the operator optimization rules before and after InferFiltersFromConstraints logical rule.

Table 3. Optimizer’s Operator Optimization Batch (in the order of execution)
Batch Name	Strategy	Rules
Operator Optimization before Inferring Filters	FixedPoint	Operator optimization rules
Infer Filters	`Once`	InferFiltersFromConstraints
Operator Optimization after Inferring Filters	FixedPoint	Operator optimization rules

Optimizer uses spark.sql.optimizer.excludedRules configuration property to control what optimization rules in the defaultBatches should be excluded (default: none).

Optimizer takes a SessionCatalog when created.

Note	`Optimizer` is a Scala abstract class and cannot be created directly. It is created indirectly when the concrete Optimizers are.

Optimizer defines the non-excludable optimization rules that are considered critical for query optimization and will never be excluded (even if they are specified in spark.sql.optimizer.excludedRules configuration property).

Table 4. Optimizer’s Non-Excludable Optimization Rules
Rule Name	Description
PushProjectionThroughUnion
EliminateDistinct
EliminateSubqueryAliases
EliminateView
ReplaceExpressions
ComputeCurrentTime
GetCurrentDatabase
RewriteDistinctAggregates
ReplaceDeduplicateWithAggregate
ReplaceIntersectWithSemiJoin
ReplaceExceptWithFilter
ReplaceExceptWithAntiJoin
RewriteExceptAll
RewriteIntersectAll
ReplaceDistinctWithAggregate
PullupCorrelatedPredicates
RewriteCorrelatedScalarSubquery
RewritePredicateSubquery
PullOutPythonUDFInJoinCondition

Table 5. Optimizer’s Internal Registries and Counters
Name	Initial Value	Description
`fixedPoint`	`FixedPoint` with the number of iterations as defined by spark.sql.optimizer.maxIterations	Used in Replace Operators, Aggregate, Operator Optimizations, Decimal Optimizations, Typed Filter Optimization and LocalRelation batches (and also indirectly in the User Provided Optimizers rule batch in SparkOptimizer).

Additional Operator Optimization Rules — `extendedOperatorOptimizationRules` Extension Point



extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]]

1

2

3

4

5

extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]]

extendedOperatorOptimizationRules extension point defines additional rules for the Operator Optimization batch.

Note	`extendedOperatorOptimizationRules` rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.

`batches` Final Method



batches: Seq[Batch]

1

2

3

4

5

batches: Seq[Batch]

Note	`batches` is part of the RuleExecutor Contract to define the rule batches to use when executed.

batches…FIXME

spark-sql 第39页

BasicStatsPlanVisitor — Computing Statistics for Cost-Based Optimization

BasicStatsPlanVisitor — Computing Statistics for Cost-Based Optimization

SizeInBytesOnlyStatsPlanVisitor — LogicalPlanVisitor for Total Size (in Bytes) Statistic Only

SizeInBytesOnlyStatsPlanVisitor — LogicalPlanVisitor for Total Size (in Bytes) Statistic Only

`default` Method

`visitIntersect` Method

`visitJoin` Method

LogicalPlanVisitor — Base Visitor for Computing Statistics of Logical Plan

LogicalPlanVisitor — Contract for Computing Statistic Estimates and Query Hints of Logical Plan

HintInfo

HintInfo

`resetForJoin` Method

Statistics — Estimates of Plan Statistics and Query Hints

Statistics — Estimates of Plan Statistics and Query Hints

LogicalPlanStats — Statistics Estimates and Query Hints of Logical Operator

LogicalPlanStats — Statistics Estimates and Query Hints of Logical Operator

Computing (and Caching) Statistics and Query Hints — `stats` Method

Invalidating Statistics Cache (of All Operators in Logical Plan) — `invalidateStatsCache` Method

SparkStrategies — Container of Execution Planning Strategies

SparkStrategies — Container of Execution Planning Strategies

SparkStrategy — Base for Execution Planning Strategies

SparkStrategy — Base for Execution Planning Strategies

PlanLater Physical Operator

SparkPlanner — Spark Query Planner

SparkPlanner — Spark Query Planner

Creating SparkPlanner Instance

Extension Point for Extra Planning Strategies — `extraPlanningStrategies` Method

Collecting PlanLater Physical Operators — `collectPlaceholders` Method

Pruning “Bad” Physical Plans — `prunePlans` Method

Creating Physical Operator (Possibly Under FilterExec and ProjectExec Operators) — `pruneFilterProject` Method

Catalyst Optimizer — Generic Logical Query Plan Optimizer

Catalyst Optimizer — Generic Logical Query Plan Optimizer

Additional Operator Optimization Rules — `extendedOperatorOptimizationRules` Extension Point

`batches` Final Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

spark-sql 第39页

BasicStatsPlanVisitor — Computing Statistics for Cost-Based Optimization

SizeInBytesOnlyStatsPlanVisitor — LogicalPlanVisitor for Total Size (in Bytes) Statistic Only

default Method

visitIntersect Method

visitJoin Method

LogicalPlanVisitor — Contract for Computing Statistic Estimates and Query Hints of Logical Plan

HintInfo

resetForJoin Method

Statistics — Estimates of Plan Statistics and Query Hints

LogicalPlanStats — Statistics Estimates and Query Hints of Logical Operator

Computing (and Caching) Statistics and Query Hints — stats Method

Invalidating Statistics Cache (of All Operators in Logical Plan) — invalidateStatsCache Method

SparkStrategies — Container of Execution Planning Strategies

SparkStrategy — Base for Execution Planning Strategies

PlanLater Physical Operator

SparkPlanner — Spark Query Planner

Creating SparkPlanner Instance

Extension Point for Extra Planning Strategies — extraPlanningStrategies Method

Collecting PlanLater Physical Operators — collectPlaceholders Method

Pruning “Bad” Physical Plans — prunePlans Method

Creating Physical Operator (Possibly Under FilterExec and ProjectExec Operators) — pruneFilterProject Method

Catalyst Optimizer — Generic Logical Query Plan Optimizer

Additional Operator Optimization Rules — extendedOperatorOptimizationRules Extension Point

batches Final Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

`default` Method

`visitIntersect` Method

`visitJoin` Method

`resetForJoin` Method

Computing (and Caching) Statistics and Query Hints — `stats` Method

Invalidating Statistics Cache (of All Operators in Logical Plan) — `invalidateStatsCache` Method

Extension Point for Extra Planning Strategies — `extraPlanningStrategies` Method

Collecting PlanLater Physical Operators — `collectPlaceholders` Method

Pruning “Bad” Physical Plans — `prunePlans` Method

Creating Physical Operator (Possibly Under FilterExec and ProjectExec Operators) — `pruneFilterProject` Method

Additional Operator Optimization Rules — `extendedOperatorOptimizationRules` Extension Point

`batches` Final Method