Catalyst Optimizer — Generic Logical Query Plan Optimizer
Optimizer
(aka Catalyst Optimizer) is the base of logical query plan optimizers that defines the rule batches of logical optimizations (i.e. logical optimizations that are the rules that transform the query plan of a structured query to produce the optimized logical plan).
Note
|
SparkOptimizer is the one and only direct implementation of the Optimizer Contract in Spark SQL.
|
Optimizer
is a RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]
).
1 2 3 4 5 |
Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan |
Optimizer
is available as the optimizer property of a session-specific SessionState
.
1 2 3 4 5 6 7 8 9 |
scala> :type spark org.apache.spark.sql.SparkSession scala> :type spark.sessionState.optimizer org.apache.spark.sql.catalyst.optimizer.Optimizer |
You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended
flag enabled) or SQL’s EXPLAIN EXTENDED
SQL command.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
// sample structured query val inventory = spark .range(5) .withColumn("new_column", 'id + 5 as "plus5") // Using explain operator (with extended flag enabled) scala> inventory.explain(extended = true) == Parsed Logical Plan == 'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3] +- AnalysisBarrier +- Range (0, 5, step=1, splits=Some(8)) == Analyzed Logical Plan == id: bigint, new_column: bigint Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L] +- Range (0, 5, step=1, splits=Some(8)) == Optimized Logical Plan == Project [id#0L, (id#0L + 5) AS new_column#3L] +- Range (0, 5, step=1, splits=Some(8)) == Physical Plan == *(1) Project [id#0L, (id#0L + 5) AS new_column#3L] +- *(1) Range (0, 5, step=1, splits=8) |
Alternatively, you can access the analyzed logical plan using QueryExecution
and its optimizedPlan property (that together with numberedTreeString
method is a very good “debugging” tool).
1 2 3 4 5 6 7 8 |
val optimizedPlan = inventory.queryExecution.optimizedPlan scala> println(optimizedPlan.numberedTreeString) 00 Project [id#0L, (id#0L + 5) AS new_column#3L] 01 +- Range (0, 5, step=1, splits=Some(8)) |
Optimizer
defines the default rule batches that are considered the base rule batches that can be further refined (extended or with some rules excluded).
Batch Name | Strategy | Rules | Description |
---|---|---|---|
|
|||
|
Removes (eliminates) SubqueryAlias unary logical operators from a logical plan |
||
Removes (eliminates) View unary logical operators from a logical plan and replaces them with their child logical operator |
|||
Replaces RuntimeReplaceable expressions with their single child expression |
|||
|
|||
|
|||
|
|||
RemoveLiteralFromGroupExpressions |
|||
RemoveRepetitionFromGroupExpressions |
|||
|
Reorders Join logical operators |
||
|
|||
EliminateMapObjects |
|||
ConvertToLocalRelation |
|||
|
PullOutPythonUDFInJoinCondition |
||
|
CheckCartesianProducts |
||
|
|||
|
UpdateNullabilityInAttributeReferences |
Tip
|
Consult the sources of the Optimizer class for the up-to-date list of the default optimization rule batches.
|
Optimizer
defines the operator optimization rules with the extendedOperatorOptimizationRules extension point for additional optimizations in the Operator Optimization batch.
Rule Name | Description |
---|---|
PushProjectionThroughUnion |
|
ReorderJoin |
|
EliminateOuterJoin |
|
PushPredicateThroughJoin |
|
PushDownPredicate |
|
LimitPushDown |
|
ColumnPruning |
|
CollapseRepartition |
|
CollapseProject |
|
CombineFilters |
|
CombineLimits |
|
CombineUnions |
|
NullPropagation |
|
ConstantPropagation |
|
FoldablePropagation |
|
OptimizeIn |
|
ConstantFolding |
|
ReorderAssociativeOperator |
|
LikeSimplification |
|
BooleanSimplification |
|
SimplifyConditionals |
|
RemoveDispensableExpressions |
|
SimplifyBinaryComparison |
|
PruneFilters |
|
EliminateSorts |
|
SimplifyCasts |
|
SimplifyCaseConversionExpressions |
|
RewriteCorrelatedScalarSubquery |
|
EliminateSerialization |
|
RemoveRedundantAliases |
|
RemoveRedundantProject |
|
SimplifyExtractValueOps |
|
CombineConcats |
Optimizer
defines Operator Optimization Batch that is simply a collection of rule batches with the operator optimization rules before and after InferFiltersFromConstraints
logical rule.
Batch Name | Strategy | Rules |
---|---|---|
Operator Optimization before Inferring Filters |
||
Infer Filters |
|
InferFiltersFromConstraints |
Operator Optimization after Inferring Filters |
Optimizer
uses spark.sql.optimizer.excludedRules configuration property to control what optimization rules in the defaultBatches should be excluded (default: none).
Optimizer
takes a SessionCatalog when created.
Note
|
Optimizer is a Scala abstract class and cannot be created directly. It is created indirectly when the concrete Optimizers are.
|
Optimizer
defines the non-excludable optimization rules that are considered critical for query optimization and will never be excluded (even if they are specified in spark.sql.optimizer.excludedRules configuration property).
Rule Name | Description |
---|---|
PushProjectionThroughUnion |
|
EliminateDistinct |
|
EliminateSubqueryAliases |
|
EliminateView |
|
ReplaceExpressions |
|
ComputeCurrentTime |
|
GetCurrentDatabase |
|
RewriteDistinctAggregates |
|
ReplaceDeduplicateWithAggregate |
|
ReplaceIntersectWithSemiJoin |
|
ReplaceExceptWithFilter |
|
ReplaceExceptWithAntiJoin |
|
RewriteExceptAll |
|
RewriteIntersectAll |
|
ReplaceDistinctWithAggregate |
|
PullupCorrelatedPredicates |
|
RewriteCorrelatedScalarSubquery |
|
RewritePredicateSubquery |
|
PullOutPythonUDFInJoinCondition |
Name | Initial Value | Description |
---|---|---|
|
Used in Replace Operators, Aggregate, Operator Optimizations, Decimal Optimizations, Typed Filter Optimization and LocalRelation batches (and also indirectly in the User Provided Optimizers rule batch in SparkOptimizer). |
Additional Operator Optimization Rules — extendedOperatorOptimizationRules
Extension Point
1 2 3 4 5 |
extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] |
extendedOperatorOptimizationRules
extension point defines additional rules for the Operator Optimization batch.
Note
|
extendedOperatorOptimizationRules rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.
|
batches
Final Method
1 2 3 4 5 |
batches: Seq[Batch] |
Note
|
batches is part of the RuleExecutor Contract to define the rule batches to use when executed.
|
batches
…FIXME