关注 spark技术分享,
撸spark源码 玩spark最佳实践

Cost-Based Optimization (CBO)

Cost-Based Optimization (CBO) of Logical Query Plan

Cost-Based Optimization (aka Cost-Based Query Optimization or CBO Optimizer) is an optimization technique in Spark SQL that uses table statistics to determine the most efficient query execution plan of a structured query (given the logical query plan).

Cost-based optimization is disabled by default. Spark SQL uses spark.sql.cbo.enabled configuration property to control whether the CBO should be enabled and used for query optimization or not.

Cost-Based Optimization uses logical optimization rules (e.g. CostBasedJoinReorder) to optimize the logical plan of a structured query based on statistics.

You first use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute table statistics. Use DESCRIBE EXTENDED SQL command to inspect the statistics.

Logical operators have statistics support that is used for query planning.

There is also support for equi-height column histograms.

Table Statistics

The table statistics can be computed for tables, partitions and columns and are as follows:

  1. Total size (in bytes) of a table or table partitions

  2. Row count of a table or table partitions

  3. Column statistics, i.e. min, max, num_nulls, distinct_count, avg_col_len, max_col_len, histogram

spark.sql.cbo.enabled Spark SQL Configuration Property

Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true.

Note
spark.sql.cbo.enabled configuration property is turned off, i.e. false, by default.
Tip
Use SQLConf.cboEnabled to access the current value of spark.sql.cbo.enabled property.

Note
CBO is disabled explicitly in Spark Structured Streaming.

ANALYZE TABLE COMPUTE STATISTICS SQL Command

Cost-Based Optimization uses the statistics stored in a metastore (aka external catalog) using ANALYZE TABLE SQL command.

Depending on the variant, ANALYZE TABLE computes different statistics, i.e. of a table, partitions or columns.

  1. ANALYZE TABLE with neither PARTITION specification nor FOR COLUMNS clause

  2. ANALYZE TABLE with PARTITION specification (but no FOR COLUMNS clause)

  3. ANALYZE TABLE with FOR COLUMNS clause (but no PARTITION specification)

Tip

Use spark.sql.statistics.histogram.enabled configuration property to enable column (equi-height) histograms that can provide better estimation accuracy but cause an extra table scan).

spark.sql.statistics.histogram.enabled is off by default.

Note

ANALYZE TABLE with PARTITION specification and FOR COLUMNS clause is incorrect.

In such a case, SparkSqlAstBuilder reports a WARN message to the logs and simply ignores the partition specification.

When executed, the above ANALYZE TABLE variants are translated to the following logical commands (in a logical query plan), respectively:

DESCRIBE EXTENDED SQL Command

You can view the statistics of a table, partitions or a column (stored in a metastore) using DESCRIBE EXTENDED SQL command.

Table-level statistics are in Statistics row while partition-level statistics are in Partition Statistics row.

Tip
Use DESC EXTENDED tableName for table-level statistics and DESC EXTENDED tableName PARTITION (p1, p2, …​) for partition-level statistics only.

You can view the statistics of a single column using DESC EXTENDED tableName columnName that are in a Dataset with two columns, i.e. info_name and info_value.

Cost-Based Optimizations

The Spark Optimizer uses heuristics (rules) that are applied to a logical query plan for cost-based optimization.

Among the optimization rules are the following:

  1. CostBasedJoinReorder logical optimization rule for join reordering with 2 or more consecutive inner or cross joins (possibly separated by Project operators) when spark.sql.cbo.enabled and spark.sql.cbo.joinReorder.enabled configuration properties are both enabled.

Logical Commands for Altering Table Statistics

The following are the logical commands that alter table statistics in a metastore (aka external catalog):

  1. AnalyzeTableCommand

  2. AnalyzeColumnCommand

  3. AlterTableAddPartitionCommand

  4. AlterTableDropPartitionCommand

  5. AlterTableSetLocationCommand

  6. TruncateTableCommand

  7. InsertIntoHiveTable

  8. InsertIntoHadoopFsRelationCommand

  9. LoadDataCommand

EXPLAIN COST SQL Command

Caution
FIXME See LogicalPlanStats

LogicalPlanStats — Statistics Estimates of Logical Operator

LogicalPlanStats adds statistics support to logical operators and is used for query planning (with or without cost-based optimization, e.g. CostBasedJoinReorder or JoinSelection, respectively).

Equi-Height Histograms for Columns

Equi-height histogram is effective in handling skewed data distribution.

For equi-height histogram, the heights of all bins(intervals) are the same. The default number of bins we use is 254.

Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin intervals);
2. use a new aggregate function to get distinct counts in each of these bins.

Note that this method takes two table scans. In the future we may provide other algorithms which need only one table scan.

Equi-height histogram is effective in cardinality estimation, and more accurate than basic column stats (min, max, ndv, etc) especially in skew distribution.

For equi-height histogram, all buckets (intervals) have the same height (frequency).

we use a two-step method to generate an equi-height histogram:

  1. use ApproximatePercentile to get percentiles p(0), p(1/n), p(2/n) …​ p((n-1)/n), p(1);

  2. construct range values of buckets, e.g. [p(0), p(1/n)], [p(1/n), p(2/n)] …​ [p((n-1)/n), p(1)], and use ApproxCountDistinctForIntervals to count ndv in each bucket. Each bucket is of the form: (lowerBound, higherBound, ndv).

Spark SQL uses column statistics that may optionally hold the histogram of values (which is empty by default). With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command generates column (equi-height) histograms.

Note
spark.sql.statistics.histogram.enabled is off by default.

You can inspect the column statistics using DESCRIBE EXTENDED SQL command.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Cost-Based Optimization (CBO)
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏