关注 spark技术分享,
撸spark源码 玩spark最佳实践

JoinEstimation

JoinEstimation

JoinEstimation is created exclusively for BasicStatsPlanVisitor to estimate statistics of a Join logical operator.

Note
BasicStatsPlanVisitor is used only when cost-based optimization is enabled.

JoinEstimation takes a Join logical operator when created.

When created, JoinEstimation immediately takes the estimated statistics and query hints of the left and right sides of the Join logical operator.

  • Inner, Cross, LeftOuter, RightOuter, FullOuter, LeftSemi and LeftAnti

For the other join types (e.g. ExistenceJoin), JoinEstimation prints out a DEBUG message to the logs and returns None (to “announce” that no statistics could be computed).

Tip

Enable DEBUG logging level for org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation logger to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

estimateInnerOuterJoin Internal Method

estimateInnerOuterJoin destructures Join logical operator into a join type with the left and right keys.

estimateInnerOuterJoin simply returns None (i.e. nothing) when either side of the Join logical operator have no row count statistic.

Note
estimateInnerOuterJoin is used exclusively when JoinEstimation is requested to estimate statistics and query hints of a Join logical operator for Inner, Cross, LeftOuter, RightOuter and FullOuter joins.

computeByNdv Internal Method

computeByNdv…​FIXME

Note
computeByNdv is used exclusively when JoinEstimation is requested for computeCardinalityAndStats

computeCardinalityAndStats Internal Method

computeCardinalityAndStats…​FIXME

Note
computeCardinalityAndStats is used exclusively when JoinEstimation is requested for estimateInnerOuterJoin

Computing Join Cardinality Using Equi-Height Histograms — computeByHistogram Internal Method

computeByHistogram…​FIXME

Note
computeByHistogram is used exclusively when JoinEstimation is requested for computeCardinalityAndStats (and the histograms of both column attributes used in a join are available).

Estimating Statistics for Left Semi and Left Anti Joins — estimateLeftSemiAntiJoin Internal Method

estimateLeftSemiAntiJoin estimates statistics of the Join logical operator only when estimated row count statistic is available. Otherwise, estimateLeftSemiAntiJoin simply returns None (i.e. no statistics estimated).

Note
row count statistic of a table is available only after ANALYZE TABLE COMPUTE STATISTICS SQL command.

If available, estimateLeftSemiAntiJoin takes the estimated row count statistic of the left side of the Join operator.

Note
Use ANALYZE TABLE COMPUTE STATISTICS SQL command on the left logical plan to compute row count statistics.
Note
Use ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command on the left logical plan to generate column (equi-height) histograms for more accurate estimations.

In the end, estimateLeftSemiAntiJoin creates a new Statistics with the following estimates:

  1. Total size (in bytes) is the output size for the output schema of the join, the row count statistic (aka output rows) and column histograms.

  2. Row count is exactly the row count of the left side

  3. Column histograms is exactly the column histograms of the left side

Note
estimateLeftSemiAntiJoin is used exclusively when JoinEstimation is requested to estimate statistics and query hints for LeftSemi and LeftAnti joins.

Estimating Statistics and Query Hints of Join Logical Operator — estimate Method

estimate estimates statistics and query hints of the Join logical operator per join type:

For other join types, estimate prints out the following DEBUG message to the logs and returns None (to “announce” that no statistics could be computed).

Note
estimate is used exclusively when BasicStatsPlanVisitor is requested to estimate statistics and query hints of a Join logical operator.
赞(0) 打赏
未经允许不得转载:spark技术分享 » JoinEstimation
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏