关注 spark技术分享,
撸spark源码 玩spark最佳实践

SparkOptimizer — Logical Query Plan Optimizer

SparkOptimizer — Logical Query Plan Optimizer

SparkOptimizer is a concrete logical query plan optimizer with additional optimization rules (that extend the base logical optimization rules).

SparkOptimizer gives three extension points for additional optimization rules:

SparkOptimizer is created when SessionState is requested for the Logical Optimizer the first time (through BaseSessionStateBuilder).

spark sql SparkOptimizer.png
Figure 1. Creating SparkOptimizer

SparkOptimizer is available as the optimizer property of a session-specific SessionState.

You can access the optimization logical plan of a structured query through the QueryExecution as optimizedPlan.

SparkOptimizer defines the custom default rule batches.

Table 1. SparkOptimizer’s Default Optimization Batch Rules (in the order of execution)
Batch Name Strategy Rules Description

preOptimizationBatches

Base Logical Optimization Batches

Optimize Metadata Only Query

Once

OptimizeMetadataOnlyQuery

Extract Python UDF from Aggregate

Once

ExtractPythonUDFFromAggregate

Prune File Source Table Partitions

Once

PruneFileSourcePartitions

Push down operators to data source scan

Once

PushDownOperatorsToDataSource

Pushes down operators to underlying data sources (i.e. DataSourceV2Relations)

postHocOptimizationBatches

User Provided Optimizers

FixedPoint

extraOptimizations of the ExperimentalMethods

SparkOptimizer considers ExtractPythonUDFFromAggregate optimization rule as non-excludable.

Tip

Enable DEBUG or TRACE logging levels for org.apache.spark.sql.execution.SparkOptimizer logger to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Creating SparkOptimizer Instance

SparkOptimizer takes the following when created:

Extension Point for Additional Pre-Optimization Batches — preOptimizationBatches Method

preOptimizationBatches are the additional pre-optimization batches that are executed right before the regular optimization batches.

Extension Point for Additional Post-Hoc Optimization Batches — postHocOptimizationBatches Method

postHocOptimizationBatches are the additional post-optimization batches that are executed right after the regular optimization batches (before User Provided Optimizers).

赞(0) 打赏
未经允许不得转载:spark技术分享 » SparkOptimizer — Logical Query Plan Optimizer
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏