SparkOptimizer — Logical Query Plan Optimizer
SparkOptimizer is a concrete logical query plan optimizer with additional optimization rules (that extend the base logical optimization rules).
SparkOptimizer gives three extension points for additional optimization rules:
SparkOptimizer is created when SessionState is requested for the Logical Optimizer the first time (through BaseSessionStateBuilder).
SparkOptimizer is available as the optimizer property of a session-specific SessionState.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
scala> :type spark org.apache.spark.sql.SparkSession scala> :type spark.sessionState.optimizer org.apache.spark.sql.catalyst.optimizer.Optimizer // It is a SparkOptimizer really. // Let's check that out with a type cast import org.apache.spark.sql.execution.SparkOptimizer scala> spark.sessionState.optimizer.isInstanceOf[SparkOptimizer] res1: Boolean = true |
You can access the optimization logical plan of a structured query through the QueryExecution as optimizedPlan.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
// Applying two filter in sequence on purpose // We want to kick CombineTypedFilters optimizer in val dataset = spark.range(10).filter(_ % 2 == 0).filter(_ == 0) // optimizedPlan is a lazy value // Only at the first time you call it you will trigger optimizations // Next calls end up with the cached already-optimized result // Use explain to trigger optimizations again scala> dataset.queryExecution.optimizedPlan res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long) +- Range (0, 10, step=1, splits=Some(8)) |
SparkOptimizer defines the custom default rule batches.
| Batch Name | Strategy | Rules | Description |
|---|---|---|---|
|
Optimize Metadata Only Query |
|
||
|
Extract Python UDF from Aggregate |
|
||
|
Prune File Source Table Partitions |
|
||
|
Push down operators to data source scan |
|
Pushes down operators to underlying data sources (i.e. DataSourceV2Relations) |
|
SparkOptimizer considers ExtractPythonUDFFromAggregate optimization rule as non-excludable.
|
Tip
|
Enable Add the following line to
Refer to Logging. |
Extension Point for Additional Pre-Optimization Batches — preOptimizationBatches Method
|
1 2 3 4 5 |
preOptimizationBatches: Seq[Batch] |
preOptimizationBatches are the additional pre-optimization batches that are executed right before the regular optimization batches.
Extension Point for Additional Post-Hoc Optimization Batches — postHocOptimizationBatches Method
|
1 2 3 4 5 |
postHocOptimizationBatches: Seq[Batch] = Nil |
postHocOptimizationBatches are the additional post-optimization batches that are executed right after the regular optimization batches (before User Provided Optimizers).
spark技术分享