SparkOptimizer — Logical Query Plan Optimizer
SparkOptimizer
is a concrete logical query plan optimizer with additional optimization rules (that extend the base logical optimization rules).
SparkOptimizer
gives three extension points for additional optimization rules:
SparkOptimizer
is created when SessionState
is requested for the Logical Optimizer the first time (through BaseSessionStateBuilder).
SparkOptimizer
is available as the optimizer property of a session-specific SessionState
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
scala> :type spark org.apache.spark.sql.SparkSession scala> :type spark.sessionState.optimizer org.apache.spark.sql.catalyst.optimizer.Optimizer // It is a SparkOptimizer really. // Let's check that out with a type cast import org.apache.spark.sql.execution.SparkOptimizer scala> spark.sessionState.optimizer.isInstanceOf[SparkOptimizer] res1: Boolean = true |
You can access the optimization logical plan of a structured query through the QueryExecution
as optimizedPlan.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
// Applying two filter in sequence on purpose // We want to kick CombineTypedFilters optimizer in val dataset = spark.range(10).filter(_ % 2 == 0).filter(_ == 0) // optimizedPlan is a lazy value // Only at the first time you call it you will trigger optimizations // Next calls end up with the cached already-optimized result // Use explain to trigger optimizations again scala> dataset.queryExecution.optimizedPlan res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long) +- Range (0, 10, step=1, splits=Some(8)) |
SparkOptimizer
defines the custom default rule batches.
Batch Name | Strategy | Rules | Description |
---|---|---|---|
Optimize Metadata Only Query |
|
||
Extract Python UDF from Aggregate |
|
||
Prune File Source Table Partitions |
|
||
Push down operators to data source scan |
|
Pushes down operators to underlying data sources (i.e. DataSourceV2Relations) |
|
SparkOptimizer
considers ExtractPythonUDFFromAggregate
optimization rule as non-excludable.
Tip
|
Enable Add the following line to
Refer to Logging. |
Extension Point for Additional Pre-Optimization Batches — preOptimizationBatches
Method
1 2 3 4 5 |
preOptimizationBatches: Seq[Batch] |
preOptimizationBatches
are the additional pre-optimization batches that are executed right before the regular optimization batches.
Extension Point for Additional Post-Hoc Optimization Batches — postHocOptimizationBatches
Method
1 2 3 4 5 |
postHocOptimizationBatches: Seq[Batch] = Nil |
postHocOptimizationBatches
are the additional post-optimization batches that are executed right after the regular optimization batches (before User Provided Optimizers).