ColumnStat — Column Statistics
ColumnStat
holds the statistics of a table column (as part of the table statistics in a metastore).
Name | Description |
---|---|
Number of distinct values |
|
Minimum value |
|
Maximum value |
|
Number of |
|
Average length of the values |
|
Maximum length of the values |
|
Histogram of values (as |
ColumnStat
is computed (and created from the result row) using ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command (that SparkSqlAstBuilder
translates to AnalyzeColumnCommand logical command).
1 2 3 4 5 6 7 |
val cols = "id, p1, p2" val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols" spark.sql(analyzeTableSQL) |
ColumnStat
may optionally hold the histogram of values which is empty by default. With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS
SQL command generates column (equi-height) histograms.
Note
|
spark.sql.statistics.histogram.enabled is off by default.
|
You can inspect the column statistics using DESCRIBE EXTENDED SQL command.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
scala> sql("DESC EXTENDED t1 id").show +--------------+----------+ |info_name |info_value| +--------------+----------+ |col_name |id | |data_type |int | |comment |NULL | |min |0 | |max |1 | |num_nulls |0 | |distinct_count|2 | |avg_col_len |4 | |max_col_len |4 | |histogram |NULL | <-- no histogram (spark.sql.statistics.histogram.enabled off) +--------------+----------+ |
ColumnStat
is part of the statistics of a table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
// Make sure that you ran ANALYZE TABLE (as described above) val db = spark.catalog.currentDatabase val tableName = "t1" val metadata = spark.sharedState.externalCatalog.getTable(db, tableName) val stats = metadata.stats.get scala> :type stats org.apache.spark.sql.catalyst.catalog.CatalogStatistics val colStats = stats.colStats scala> :type colStats Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat] |
ColumnStat
is converted to properties (serialized) while persisting the table (statistics) to a metastore.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
scala> :type colStats Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat] val colName = "p1" val p1stats = colStats(colName) scala> :type p1stats org.apache.spark.sql.catalyst.plans.logical.ColumnStat import org.apache.spark.sql.types.DoubleType val props = p1stats.toMap(colName, dataType = DoubleType) scala> println(props) Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0) |
ColumnStat
is re-created from properties (deserialized) when HiveExternalCatalog
is requested for restoring table statistics from properties (from a Hive Metastore).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
scala> :type props Map[String,String] scala> println(props) Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0) import org.apache.spark.sql.types.StructField val p1 = $"p1".double import org.apache.spark.sql.catalyst.plans.logical.ColumnStat val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props) scala> :type colStatsOpt Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat] |
ColumnStat
is also created when JoinEstimation
is requested to estimateInnerOuterJoin for Inner
, Cross
, LeftOuter
, RightOuter
and FullOuter
joins.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
val tableName = "t1" // Make the example reproducible import org.apache.spark.sql.catalyst.TableIdentifier val tid = TableIdentifier(tableName) val sessionCatalog = spark.sessionState.catalog sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true) // CREATE TABLE t1 Seq((0, 0, "zero"), (1, 1, "one")). toDF("id", "p1", "p2"). write. saveAsTable(tableName) // As we drop and create immediately we may face problems with unavailable partition files // Invalidate cache spark.sql(s"REFRESH TABLE $tableName") // Use ANALYZE TABLE...FOR COLUMNS to compute column statistics // that saves them in a metastore (aka an external catalog) val df = spark.table(tableName) val allCols = df.columns.mkString(",") val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols" spark.sql(analyzeTableSQL) // Fetch the table metadata (with column statistics) from a metastore val metastore = spark.sharedState.externalCatalog val db = spark.catalog.currentDatabase val tableMeta = metastore.getTable(db, table = tableName) // The column statistics are part of the table statistics val colStats = tableMeta.stats.get.colStats scala> :type colStats Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat] scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println) // the output may vary id: ColumnStat(2,Some(0),Some(1),0,4,4,None) p1: ColumnStat(2,Some(0),Some(1),0,4,4,None) p2: ColumnStat(2,None,None,0,4,4,None) |
Note
|
ColumnStat does not support minimum and maximum metrics for binary (i.e. Array[Byte] ) and string types.
|
Converting Value to External/Java Representation (per Catalyst Data Type) — toExternalString
Internal Method
1 2 3 4 5 |
toExternalString(v: Any, colName: String, dataType: DataType): String |
toExternalString
…FIXME
Note
|
toExternalString is used exclusively when ColumnStat is requested for statistic properties.
|
supportsHistogram
Method
1 2 3 4 5 |
supportsHistogram(dataType: DataType): Boolean |
supportsHistogram
…FIXME
Note
|
supportsHistogram is used when…FIXME
|
Converting ColumnStat to Properties (ColumnStat Serialization) — toMap
Method
1 2 3 4 5 |
toMap(colName: String, dataType: DataType): Map[String, String] |
toMap
converts ColumnStat to the properties.
Key | Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Serialized version of Histogram (using |
Note
|
toMap adds min , max , histogram entries only if they are available.
|
Note
|
Interestingly, colName and dataType input parameters bring no value to toMap itself, but merely allow for a more user-friendly error reporting when converting min and max column statistics.
|
Note
|
toMap is used exclusively when HiveExternalCatalog is requested for converting table statistics to properties (before persisting them as part of table metadata in a Hive metastore).
|
Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — fromMap
Method
1 2 3 4 5 |
fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat] |
fromMap
creates a ColumnStat
by fetching properties of every column statistic from the input map
.
fromMap
returns None
when recovering column statistics fails for whatever reason.
1 2 3 4 5 |
WARN Failed to parse column statistics for column [fieldName] in table [table] |
Note
|
Interestingly, table input parameter brings no value to fromMap itself, but merely allows for a more user-friendly error reporting when parsing column statistics fails.
|
Note
|
fromMap is used exclusively when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).
|
Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — rowToColumnStat
Method
1 2 3 4 5 6 7 8 9 |
rowToColumnStat( row: InternalRow, attr: Attribute, rowCount: Long, percentiles: Option[ArrayData]): ColumnStat |
rowToColumnStat
creates a ColumnStat
from the input row
and the following positions:
If the 6
th field is not empty, rowToColumnStat
uses it to create histogram.
Note
|
rowToColumnStat is used exclusively when AnalyzeColumnCommand is executed (to compute the statistics for specified columns).
|