CatalogStatistics — Table Statistics From External Catalog (Metastore)
CatalogStatistics
are table statistics that are stored in an external catalog (aka metastore):
-
Column statistics (i.e. column names and their statistics)
Note
|
|
CatalogStatistics
can be converted to Spark statistics using toPlanStats method.
CatalogStatistics
is created when:
-
AnalyzeColumnCommand,
AlterTableAddPartitionCommand
andTruncateTableCommand
commands are executed (and store statistics in ExternalCatalog) -
CommandUtils
is requested for updating existing table statistics, the current statistics (if changed) -
HiveExternalCatalog
is requested for restoring Spark statistics from properties (from a Hive Metastore) -
DetermineTableStats and PruneFileSourcePartitions logical optimizations are executed (i.e. applied to a logical plan)
-
HiveClientImpl
is requested for a table or partition statistics from Hive’s parameters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
scala> :type spark.sessionState.catalog org.apache.spark.sql.catalyst.catalog.SessionCatalog // Using higher-level interface to access CatalogStatistics // Make sure that you ran ANALYZE TABLE (as described above) val db = spark.catalog.currentDatabase val tableName = "t1" val metadata = spark.sharedState.externalCatalog.getTable(db, tableName) val stats = metadata.stats scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics] // Using low-level internal SessionCatalog interface to access CatalogTables val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName) val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid) val stats = metadata.stats scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics] |
CatalogStatistics
has a text representation.
1 2 3 4 5 6 7 8 9 |
scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics] scala> stats.map(_.simpleString).foreach(println) 714 bytes, 2 rows |
Converting Metastore Statistics to Spark Statistics — toPlanStats
Method
1 2 3 4 5 |
toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): Statistics |
toPlanStats
converts the table statistics (from an external metastore) to Spark statistics.
With cost-based optimization enabled and row count statistics available, toPlanStats
creates a Statistics with the estimated total (output) size, row count and column statistics.
Note
|
Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true , and is disabled by default.
|
Otherwise, when cost-based optimization is disabled, toPlanStats
creates a Statistics with just the mandatory sizeInBytes.
Caution
|
FIXME Why does toPlanStats compute sizeInBytes differently per CBO?
|
Note
|
|
Note
|
toPlanStats is used when HiveTableRelation and LogicalRelation are requested for statistics.
|