关注 spark技术分享,
撸spark源码 玩spark最佳实践

CatalogStatistics — Table Statistics in Metastore (External Catalog)

CatalogStatistics — Table Statistics From External Catalog (Metastore)


CatalogStatistics are table statistics that are stored in an external catalog (aka metastore):

  • Physical total size (in bytes)

  • Estimated number of rows (aka row count)

  • Column statistics (i.e. column names and their statistics)

Note

CatalogStatistics is a “subset” of the statistics in Statistics (as there are no concepts of attributes and broadcast hint in metastore).

CatalogStatistics are often stored in a Hive metastore and are referred as Hive statistics while Statistics are the Spark statistics.

CatalogStatistics can be converted to Spark statistics using toPlanStats method.

CatalogStatistics is created when:

CatalogStatistics has a text representation.

Converting Metastore Statistics to Spark Statistics — toPlanStats Method

toPlanStats converts the table statistics (from an external metastore) to Spark statistics.

With cost-based optimization enabled and row count statistics available, toPlanStats creates a Statistics with the estimated total (output) size, row count and column statistics.

Note
Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true, and is disabled by default.

Otherwise, when cost-based optimization is disabled, toPlanStats creates a Statistics with just the mandatory sizeInBytes.

Caution
FIXME Why does toPlanStats compute sizeInBytes differently per CBO?
Note

toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.

Note
toPlanStats is used when HiveTableRelation and LogicalRelation are requested for statistics.
赞(0) 打赏
未经允许不得转载:spark技术分享 » CatalogStatistics — Table Statistics in Metastore (External Catalog)
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏