ColumnStat — Column Statistics-spark技术分享

ColumnStat — Column Statistics

ColumnStat holds the statistics of a table column (as part of the table statistics in a metastore).

Table 1. Column Statistics
Name	Description
`distinctCount`	Number of distinct values
`min`	Minimum value
`max`	Maximum value
`nullCount`	Number of `null` values
`avgLen`	Average length of the values
`maxLen`	Maximum length of the values
`histogram`	Histogram of values (as `Histogram` which is empty by default)

ColumnStat is computed (and created from the result row) using ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command (that SparkSqlAstBuilder translates to AnalyzeColumnCommand logical command).



val cols = "id, p1, p2"
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols"
spark.sql(analyzeTableSQL)

val cols = "id, p1, p2"

val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols"

spark.sql(analyzeTableSQL)

ColumnStat may optionally hold the histogram of values which is empty by default. With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command generates column (equi-height) histograms.

Note	`spark.sql.statistics.histogram.enabled` is off by default.

You can inspect the column statistics using DESCRIBE EXTENDED SQL command.



scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      | <-- no histogram (spark.sql.statistics.histogram.enabled off)
+--------------+----------+

scala> sql("DESC EXTENDED t1 id").show

+--------------+----------+

|info_name |info_value|

+--------------+----------+

|col_name |id |

|data_type |int |

|comment |NULL |

|min |0 |

|max |1 |

|num_nulls |0 |

|distinct_count|2 |

|avg_col_len |4 |

|max_col_len |4 |

|histogram |NULL | <-- no histogram (spark.sql.statistics.histogram.enabled off)

+--------------+----------+

ColumnStat is part of the statistics of a table.



// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats.get

scala> :type stats
org.apache.spark.sql.catalyst.catalog.CatalogStatistics

val colStats = stats.colStats
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

// Make sure that you ran ANALYZE TABLE (as described above)

val db = spark.catalog.currentDatabase

val tableName = "t1"

val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)

val stats = metadata.stats.get

scala> :type stats

org.apache.spark.sql.catalyst.catalog.CatalogStatistics

val colStats = stats.colStats

scala> :type colStats

Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

ColumnStat is converted to properties (serialized) while persisting the table (statistics) to a metastore.



scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

val colName = "p1"

val p1stats = colStats(colName)
scala> :type p1stats
org.apache.spark.sql.catalyst.plans.logical.ColumnStat

import org.apache.spark.sql.types.DoubleType
val props = p1stats.toMap(colName, dataType = DoubleType)
scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

scala> :type colStats

Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

val colName = "p1"

val p1stats = colStats(colName)

scala> :type p1stats

org.apache.spark.sql.catalyst.plans.logical.ColumnStat

import org.apache.spark.sql.types.DoubleType

val props = p1stats.toMap(colName, dataType = DoubleType)

scala> println(props)

Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

ColumnStat is re-created from properties (deserialized) when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).



scala> :type props
Map[String,String]

scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

import org.apache.spark.sql.types.StructField
val p1 = $"p1".double

import org.apache.spark.sql.catalyst.plans.logical.ColumnStat
val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)

scala> :type colStatsOpt
Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

scala> :type props

Map[String,String]

scala> println(props)

Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

import org.apache.spark.sql.types.StructField

val p1 = $"p1".double

import org.apache.spark.sql.catalyst.plans.logical.ColumnStat

val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)

scala> :type colStatsOpt

Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

ColumnStat is also created when JoinEstimation is requested to estimateInnerOuterJoin for Inner, Cross, LeftOuter, RightOuter and FullOuter joins.



val tableName = "t1"

// Make the example reproducible
import org.apache.spark.sql.catalyst.TableIdentifier
val tid = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1
Seq((0, 0, "zero"), (1, 1, "one")).
  toDF("id", "p1", "p2").
  write.
  saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition files
// Invalidate cache
spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics
// that saves them in a metastore (aka an external catalog)
val df = spark.table(tableName)
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"
spark.sql(analyzeTableSQL)

// Fetch the table metadata (with column statistics) from a metastore
val metastore = spark.sharedState.externalCatalog
val db = spark.catalog.currentDatabase
val tableMeta = metastore.getTable(db, table = tableName)

// The column statistics are part of the table statistics
val colStats = tableMeta.stats.get.colStats

scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println)
// the output may vary
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p2: ColumnStat(2,None,None,0,4,4,None)

val tableName = "t1"

// Make the example reproducible

import org.apache.spark.sql.catalyst.TableIdentifier

val tid = TableIdentifier(tableName)

val sessionCatalog = spark.sessionState.catalog

sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1

Seq((0, 0, "zero"), (1, 1, "one")).

toDF("id", "p1", "p2").

write.

saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition files

// Invalidate cache

spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics

// that saves them in a metastore (aka an external catalog)

val df = spark.table(tableName)

val allCols = df.columns.mkString(",")

val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"

spark.sql(analyzeTableSQL)

// Fetch the table metadata (with column statistics) from a metastore

val metastore = spark.sharedState.externalCatalog

val db = spark.catalog.currentDatabase

val tableMeta = metastore.getTable(db, table = tableName)

// The column statistics are part of the table statistics

val colStats = tableMeta.stats.get.colStats

scala> :type colStats

Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println)

// the output may vary

id: ColumnStat(2,Some(0),Some(1),0,4,4,None)

p1: ColumnStat(2,Some(0),Some(1),0,4,4,None)

p2: ColumnStat(2,None,None,0,4,4,None)

Note	`ColumnStat` does not support minimum and maximum metrics for binary (i.e. `Array[Byte]`) and string types.

Converting Value to External/Java Representation (per Catalyst Data Type) — `toExternalString` Internal Method



toExternalString(v: Any, colName: String, dataType: DataType): String

toExternalString(v: Any, colName: String, dataType: DataType): String

toExternalString…FIXME

Note	`toExternalString` is used exclusively when `ColumnStat` is requested for statistic properties.

`supportsHistogram` Method



supportsHistogram(dataType: DataType): Boolean

supportsHistogram(dataType: DataType): Boolean

supportsHistogram…FIXME

Note	`supportsHistogram` is used when…FIXME

Converting ColumnStat to Properties (ColumnStat Serialization) — `toMap` Method



toMap(colName: String, dataType: DataType): Map[String, String]

toMap(colName: String, dataType: DataType): Map[String, String]

toMap converts ColumnStat to the properties.

Table 2. ColumnStat.toMap’s Properties
Key	Value
`version`	`1`
`distinctCount`	distinctCount
`nullCount`	nullCount
`avgLen`	avgLen
`maxLen`	maxLen
`min`	External/Java representation of min
`max`	External/Java representation of max
`histogram`	Serialized version of Histogram (using `HistogramSerializer.serialize`)

Note	`toMap` adds `min`, `max`, `histogram` entries only if they are available.

Note	Interestingly, `colName` and `dataType` input parameters bring no value to `toMap` itself, but merely allow for a more user-friendly error reporting when converting `min` and `max` column statistics.

Note	`toMap` is used exclusively when `HiveExternalCatalog` is requested for converting table statistics to properties (before persisting them as part of table metadata in a Hive metastore).

Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — `fromMap` Method



fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat]

fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat]

fromMap creates a ColumnStat by fetching properties of every column statistic from the input map.

fromMap returns None when recovering column statistics fails for whatever reason.



WARN Failed to parse column statistics for column [fieldName] in table [table]

WARN Failed to parse column statistics for column [fieldName] in table [table]

Note	Interestingly, `table` input parameter brings no value to `fromMap` itself, but merely allows for a more user-friendly error reporting when parsing column statistics fails.

Note	`fromMap` is used exclusively when `HiveExternalCatalog` is requested for restoring table statistics from properties (from a Hive Metastore).

Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — `rowToColumnStat` Method



rowToColumnStat(
  row: InternalRow,
  attr: Attribute,
  rowCount: Long,
  percentiles: Option[ArrayData]): ColumnStat

rowToColumnStat(

row: InternalRow,

attr: Attribute,

rowCount: Long,

percentiles: Option[ArrayData]): ColumnStat

rowToColumnStat creates a ColumnStat from the input row and the following positions:

distinctCount
min
max
nullCount
avgLen
maxLen

If the 6th field is not empty, rowToColumnStat uses it to create histogram.

Note	`rowToColumnStat` is used exclusively when `AnalyzeColumnCommand` is executed (to compute the statistics for specified columns).

`statExprs` Method



statExprs(
  col: Attribute,
  conf: SQLConf,
  colPercentiles: AttributeMap[ArrayData]): CreateNamedStruct

statExprs(

col: Attribute,

conf: SQLConf,

colPercentiles: AttributeMap[ArrayData]): CreateNamedStruct

statExprs…FIXME

Note	`statExprs` is used when…FIXME

ColumnStat — Column Statistics