Catalyst Optimizer — Generic Logical Query Plan Optimizer
Optimizer (aka Catalyst Optimizer) is the base of logical query plan optimizers that defines the rule batches of logical optimizations (i.e. logical optimizations that are the rules that transform the query plan of a structured query to produce the optimized logical plan).
|
Note
|
SparkOptimizer is the one and only direct implementation of the Optimizer Contract in Spark SQL.
|
Optimizer is a RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]).
|
1 2 3 4 5 |
Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan |
Optimizer is available as the optimizer property of a session-specific SessionState.
|
1 2 3 4 5 6 7 8 9 |
scala> :type spark org.apache.spark.sql.SparkSession scala> :type spark.sessionState.optimizer org.apache.spark.sql.catalyst.optimizer.Optimizer |
You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL’s EXPLAIN EXTENDED SQL command.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
// sample structured query val inventory = spark .range(5) .withColumn("new_column", 'id + 5 as "plus5") // Using explain operator (with extended flag enabled) scala> inventory.explain(extended = true) == Parsed Logical Plan == 'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3] +- AnalysisBarrier +- Range (0, 5, step=1, splits=Some(8)) == Analyzed Logical Plan == id: bigint, new_column: bigint Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L] +- Range (0, 5, step=1, splits=Some(8)) == Optimized Logical Plan == Project [id#0L, (id#0L + 5) AS new_column#3L] +- Range (0, 5, step=1, splits=Some(8)) == Physical Plan == *(1) Project [id#0L, (id#0L + 5) AS new_column#3L] +- *(1) Range (0, 5, step=1, splits=8) |
Alternatively, you can access the analyzed logical plan using QueryExecution and its optimizedPlan property (that together with numberedTreeString method is a very good “debugging” tool).
|
1 2 3 4 5 6 7 8 |
val optimizedPlan = inventory.queryExecution.optimizedPlan scala> println(optimizedPlan.numberedTreeString) 00 Project [id#0L, (id#0L + 5) AS new_column#3L] 01 +- Range (0, 5, step=1, splits=Some(8)) |
Optimizer defines the default rule batches that are considered the base rule batches that can be further refined (extended or with some rules excluded).
| Batch Name | Strategy | Rules | Description |
|---|---|---|---|
|
|
|||
|
|
Removes (eliminates) SubqueryAlias unary logical operators from a logical plan |
||
|
Removes (eliminates) View unary logical operators from a logical plan and replaces them with their child logical operator |
|||
|
Replaces RuntimeReplaceable expressions with their single child expression |
|||
|
|
|||
|
|
|||
|
|
|||
|
RemoveLiteralFromGroupExpressions |
|||
|
RemoveRepetitionFromGroupExpressions |
|||
|
|
Reorders Join logical operators |
||
|
|
|||
|
EliminateMapObjects |
|||
|
ConvertToLocalRelation |
|||
|
|
PullOutPythonUDFInJoinCondition |
||
|
|
CheckCartesianProducts |
||
|
|
|||
|
|
UpdateNullabilityInAttributeReferences |
|
Tip
|
Consult the sources of the Optimizer class for the up-to-date list of the default optimization rule batches.
|
Optimizer defines the operator optimization rules with the extendedOperatorOptimizationRules extension point for additional optimizations in the Operator Optimization batch.
| Rule Name | Description |
|---|---|
|
PushProjectionThroughUnion |
|
|
ReorderJoin |
|
|
EliminateOuterJoin |
|
|
PushPredicateThroughJoin |
|
|
PushDownPredicate |
|
|
LimitPushDown |
|
|
ColumnPruning |
|
|
CollapseRepartition |
|
|
CollapseProject |
|
|
CombineFilters |
|
|
CombineLimits |
|
|
CombineUnions |
|
|
NullPropagation |
|
|
ConstantPropagation |
|
|
FoldablePropagation |
|
|
OptimizeIn |
|
|
ConstantFolding |
|
|
ReorderAssociativeOperator |
|
|
LikeSimplification |
|
|
BooleanSimplification |
|
|
SimplifyConditionals |
|
|
RemoveDispensableExpressions |
|
|
SimplifyBinaryComparison |
|
|
PruneFilters |
|
|
EliminateSorts |
|
|
SimplifyCasts |
|
|
SimplifyCaseConversionExpressions |
|
|
RewriteCorrelatedScalarSubquery |
|
|
EliminateSerialization |
|
|
RemoveRedundantAliases |
|
|
RemoveRedundantProject |
|
|
SimplifyExtractValueOps |
|
|
CombineConcats |
Optimizer defines Operator Optimization Batch that is simply a collection of rule batches with the operator optimization rules before and after InferFiltersFromConstraints logical rule.
| Batch Name | Strategy | Rules |
|---|---|---|
|
Operator Optimization before Inferring Filters |
||
|
Infer Filters |
|
InferFiltersFromConstraints |
|
Operator Optimization after Inferring Filters |
Optimizer uses spark.sql.optimizer.excludedRules configuration property to control what optimization rules in the defaultBatches should be excluded (default: none).
Optimizer takes a SessionCatalog when created.
|
Note
|
Optimizer is a Scala abstract class and cannot be created directly. It is created indirectly when the concrete Optimizers are.
|
Optimizer defines the non-excludable optimization rules that are considered critical for query optimization and will never be excluded (even if they are specified in spark.sql.optimizer.excludedRules configuration property).
| Rule Name | Description |
|---|---|
|
PushProjectionThroughUnion |
|
|
EliminateDistinct |
|
|
EliminateSubqueryAliases |
|
|
EliminateView |
|
|
ReplaceExpressions |
|
|
ComputeCurrentTime |
|
|
GetCurrentDatabase |
|
|
RewriteDistinctAggregates |
|
|
ReplaceDeduplicateWithAggregate |
|
|
ReplaceIntersectWithSemiJoin |
|
|
ReplaceExceptWithFilter |
|
|
ReplaceExceptWithAntiJoin |
|
|
RewriteExceptAll |
|
|
RewriteIntersectAll |
|
|
ReplaceDistinctWithAggregate |
|
|
PullupCorrelatedPredicates |
|
|
RewriteCorrelatedScalarSubquery |
|
|
RewritePredicateSubquery |
|
|
PullOutPythonUDFInJoinCondition |
| Name | Initial Value | Description |
|---|---|---|
|
|
Used in Replace Operators, Aggregate, Operator Optimizations, Decimal Optimizations, Typed Filter Optimization and LocalRelation batches (and also indirectly in the User Provided Optimizers rule batch in SparkOptimizer). |
Additional Operator Optimization Rules — extendedOperatorOptimizationRules Extension Point
|
1 2 3 4 5 |
extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] |
extendedOperatorOptimizationRules extension point defines additional rules for the Operator Optimization batch.
|
Note
|
extendedOperatorOptimizationRules rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.
|
batches Final Method
|
1 2 3 4 5 |
batches: Seq[Batch] |
|
Note
|
batches is part of the RuleExecutor Contract to define the rule batches to use when executed.
|
batches…FIXME
spark技术分享