关注 spark技术分享,
撸spark源码 玩spark最佳实践

ColumnStat — Column Statistics

ColumnStat — Column Statistics

ColumnStat holds the statistics of a table column (as part of the table statistics in a metastore).

Table 1. Column Statistics
Name Description

distinctCount

Number of distinct values

min

Minimum value

max

Maximum value

nullCount

Number of null values

avgLen

Average length of the values

maxLen

Maximum length of the values

histogram

Histogram of values (as Histogram which is empty by default)

ColumnStat is computed (and created from the result row) using ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command (that SparkSqlAstBuilder translates to AnalyzeColumnCommand logical command).

ColumnStat may optionally hold the histogram of values which is empty by default. With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command generates column (equi-height) histograms.

Note
spark.sql.statistics.histogram.enabled is off by default.

You can inspect the column statistics using DESCRIBE EXTENDED SQL command.

ColumnStat is part of the statistics of a table.

ColumnStat is converted to properties (serialized) while persisting the table (statistics) to a metastore.

ColumnStat is re-created from properties (deserialized) when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).

ColumnStat is also created when JoinEstimation is requested to estimateInnerOuterJoin for Inner, Cross, LeftOuter, RightOuter and FullOuter joins.

Note
ColumnStat does not support minimum and maximum metrics for binary (i.e. Array[Byte]) and string types.

Converting Value to External/Java Representation (per Catalyst Data Type) — toExternalString Internal Method

toExternalString…​FIXME

Note
toExternalString is used exclusively when ColumnStat is requested for statistic properties.

supportsHistogram Method

supportsHistogram…​FIXME

Note
supportsHistogram is used when…​FIXME

Converting ColumnStat to Properties (ColumnStat Serialization) — toMap Method

toMap converts ColumnStat to the properties.

Table 2. ColumnStat.toMap’s Properties
Key Value

version

1

distinctCount

distinctCount

nullCount

nullCount

avgLen

avgLen

maxLen

maxLen

min

External/Java representation of min

max

External/Java representation of max

histogram

Serialized version of Histogram (using HistogramSerializer.serialize)

Note
toMap adds min, max, histogram entries only if they are available.
Note
Interestingly, colName and dataType input parameters bring no value to toMap itself, but merely allow for a more user-friendly error reporting when converting min and max column statistics.
Note
toMap is used exclusively when HiveExternalCatalog is requested for converting table statistics to properties (before persisting them as part of table metadata in a Hive metastore).

Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — fromMap Method

fromMap creates a ColumnStat by fetching properties of every column statistic from the input map.

fromMap returns None when recovering column statistics fails for whatever reason.

Note
Interestingly, table input parameter brings no value to fromMap itself, but merely allows for a more user-friendly error reporting when parsing column statistics fails.
Note
fromMap is used exclusively when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).

Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — rowToColumnStat Method

rowToColumnStat creates a ColumnStat from the input row and the following positions:

If the 6th field is not empty, rowToColumnStat uses it to create histogram.

Note
rowToColumnStat is used exclusively when AnalyzeColumnCommand is executed (to compute the statistics for specified columns).

statExprs Method

statExprs…​FIXME

Note
statExprs is used when…​FIXME
赞(0) 打赏
未经允许不得转载:spark技术分享 » ColumnStat — Column Statistics
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏