AnalyzeColumnCommand-spark技术分享

AnalyzeColumnCommand Logical Command for ANALYZE TABLE…COMPUTE STATISTICS FOR COLUMNS SQL Command

AnalyzeColumnCommand is a logical command for ANALYZE TABLE with FOR COLUMNS clause (and no PARTITION specification).



ANALYZE TABLE tableName COMPUTE STATISTICS FOR COLUMNS columnNames

ANALYZE TABLE tableName COMPUTE STATISTICS FOR COLUMNS columnNames



// Make the example reproducible
val tableName = "t1"
import org.apache.spark.sql.catalyst.TableIdentifier
val tableId = TableIdentifier(tableName)

val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tableId, ignoreIfNotExists = true, purge = true)

val df = Seq((0, 0.0, "zero"), (1, 1.4, "one")).toDF("id", "p1", "p2")
df.write.saveAsTable("t1")

// AnalyzeColumnCommand represents ANALYZE TABLE...FOR COLUMNS SQL command
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS $allCols"
val plan = spark.sql(analyzeTableSQL).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeColumnCommand
val cmd = plan.asInstanceOf[AnalyzeColumnCommand]
scala> println(cmd)
AnalyzeColumnCommand `t1`, [id, p1, p2]

spark.sql(analyzeTableSQL)
val stats = sessionCatalog.getTableMetadata(tableId).stats.get
scala> println(stats.simpleString)
1421 bytes, 2 rows

scala> stats.colStats.map { case (c, ss) => s"$c: $ss" }.foreach(println)
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0.0),Some(1.4),0,8,8,None)
p2: ColumnStat(2,None,None,0,4,4,None)

// Use DESC EXTENDED for friendlier output
scala> sql(s"DESC EXTENDED $tableName id").show
+--------------+----------+
|     info_name|info_value|
+--------------+----------+
|      col_name|        id|
|     data_type|       int|
|       comment|      NULL|
|           min|         0|
|           max|         1|
|     num_nulls|         0|
|distinct_count|         2|
|   avg_col_len|         4|
|   max_col_len|         4|
|     histogram|      NULL|
+--------------+----------+

// Make the example reproducible

val tableName = "t1"

import org.apache.spark.sql.catalyst.TableIdentifier

val tableId = TableIdentifier(tableName)

val sessionCatalog = spark.sessionState.catalog

sessionCatalog.dropTable(tableId, ignoreIfNotExists = true, purge = true)

val df = Seq((0, 0.0, "zero"), (1, 1.4, "one")).toDF("id", "p1", "p2")

df.write.saveAsTable("t1")

// AnalyzeColumnCommand represents ANALYZE TABLE...FOR COLUMNS SQL command

val allCols = df.columns.mkString(",")

val analyzeTableSQL = s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS $allCols"

val plan = spark.sql(analyzeTableSQL).queryExecution.logical

import org.apache.spark.sql.execution.command.AnalyzeColumnCommand

val cmd = plan.asInstanceOf[AnalyzeColumnCommand]

scala> println(cmd)

AnalyzeColumnCommand `t1`, [id, p1, p2]

spark.sql(analyzeTableSQL)

val stats = sessionCatalog.getTableMetadata(tableId).stats.get

scala> println(stats.simpleString)

1421 bytes, 2 rows

scala> stats.colStats.map { case (c, ss) => s"$c: $ss" }.foreach(println)

id: ColumnStat(2,Some(0),Some(1),0,4,4,None)

p1: ColumnStat(2,Some(0.0),Some(1.4),0,8,8,None)

p2: ColumnStat(2,None,None,0,4,4,None)

// Use DESC EXTENDED for friendlier output

scala> sql(s"DESC EXTENDED $tableName id").show

+--------------+----------+

| info_name|info_value|

+--------------+----------+

| col_name| id|

| data_type| int|

| comment| NULL|

| min| 0|

| max| 1|

| num_nulls| 0|

|distinct_count| 2|

| avg_col_len| 4|

| max_col_len| 4|

| histogram| NULL|

+--------------+----------+

AnalyzeColumnCommand can generate column histograms when spark.sql.statistics.histogram.enabled configuration property is turned on (which is disabled by default). AnalyzeColumnCommand supports column histograms for the following data types:

IntegralType
DecimalType
DoubleType
FloatType
DateType
TimestampType

Note

Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan.



// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true
// Use the above example to set up the environment
// Make sure that ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS was run with histogram enabled

// There are 254 bins by default
// Use spark.sql.statistics.histogram.numBins to control the bins
val descExtSQL = s"DESC EXTENDED $tableName p1"
scala> spark.sql(descExtSQL).show(truncate = false)
+--------------+-----------------------------------------------------+
|info_name     |info_value                                           |
+--------------+-----------------------------------------------------+
|col_name      |p1                                                   |
|data_type     |double                                               |
|comment       |NULL                                                 |
|min           |0.0                                                  |
|max           |1.4                                                  |
|num_nulls     |0                                                    |
|distinct_count|2                                                    |
|avg_col_len   |8                                                    |
|max_col_len   |8                                                    |
|histogram     |height: 0.007874015748031496, num_of_bins: 254       |
|bin_0         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_1         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_2         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_3         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_4         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_5         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_6         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_7         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_8         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_9         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
+--------------+-----------------------------------------------------+
only showing top 20 rows

// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true

// Use the above example to set up the environment

// Make sure that ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS was run with histogram enabled

// There are 254 bins by default

// Use spark.sql.statistics.histogram.numBins to control the bins

val descExtSQL = s"DESC EXTENDED $tableName p1"

scala> spark.sql(descExtSQL).show(truncate = false)

+--------------+-----------------------------------------------------+

|info_name |info_value |

+--------------+-----------------------------------------------------+

|col_name |p1 |

|data_type |double |

|comment |NULL |

|min |0.0 |

|max |1.4 |

|num_nulls |0 |

|distinct_count|2 |

|avg_col_len |8 |

|max_col_len |8 |

|histogram |height: 0.007874015748031496, num_of_bins: 254 |

|bin_0 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_1 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_2 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_3 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_4 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_5 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_6 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_7 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_8 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_9 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

+--------------+-----------------------------------------------------+

only showing top 20 rows

Note	`AnalyzeColumnCommand` is described by `analyze` labeled alternative in `statement` expression in `SqlBase.g4` and parsed using SparkSqlAstBuilder.

Note	`AnalyzeColumnCommand` is not supported on views.

Executing Logical Command — `run` Method



run(sparkSession: SparkSession): Seq[Row]

run(sparkSession: SparkSession): Seq[Row]

Note	`run` is part of RunnableCommand Contract to execute (run) a logical command.

run calculates the following statistics:

sizeInBytes
stats for each column

Caution

FIXME

Computing Statistics for Specified Columns — `computeColumnStats` Internal Method



computeColumnStats(
  sparkSession: SparkSession,
  tableIdent: TableIdentifier,
  columnNames: Seq[String]): (Long, Map[String, ColumnStat])

computeColumnStats(

sparkSession: SparkSession,

tableIdent: TableIdentifier,

columnNames: Seq[String]): (Long, Map[String, ColumnStat])

computeColumnStats…FIXME

Note	`computeColumnStats` is used exclusively when `AnalyzeColumnCommand` is executed.

`computePercentiles` Internal Method



computePercentiles(
  attributesToAnalyze: Seq[Attribute],
  sparkSession: SparkSession,
  relation: LogicalPlan): AttributeMap[ArrayData]

computePercentiles(

attributesToAnalyze: Seq[Attribute],

sparkSession: SparkSession,

relation: LogicalPlan): AttributeMap[ArrayData]

computePercentiles…FIXME

Note	`computePercentiles` is used exclusively when `AnalyzeColumnCommand` is executed (and computes column statistics).

Creating AnalyzeColumnCommand Instance

AnalyzeColumnCommand takes the following when created:

TableIdentifier
Column names

AnalyzeColumnCommand

AnalyzeColumnCommand Logical Command for ANALYZE TABLE…COMPUTE STATISTICS FOR COLUMNS SQL Command

Executing Logical Command — `run` Method

Computing Statistics for Specified Columns — `computeColumnStats` Internal Method

`computePercentiles` Internal Method

Creating AnalyzeColumnCommand Instance

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

AnalyzeColumnCommand Logical Command for ANALYZE TABLE…COMPUTE STATISTICS FOR COLUMNS SQL Command

Executing Logical Command — run Method

Computing Statistics for Specified Columns — computeColumnStats Internal Method

computePercentiles Internal Method

Creating AnalyzeColumnCommand Instance

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

Executing Logical Command — `run` Method

Computing Statistics for Specified Columns — `computeColumnStats` Internal Method

`computePercentiles` Internal Method