AnalyzeColumnCommand Logical Command for ANALYZE TABLE…COMPUTE STATISTICS FOR COLUMNS SQL Command
AnalyzeColumnCommand
is a logical command for ANALYZE TABLE with FOR COLUMNS
clause (and no PARTITION
specification).
1 2 3 4 5 |
ANALYZE TABLE tableName COMPUTE STATISTICS FOR COLUMNS columnNames |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
// Make the example reproducible val tableName = "t1" import org.apache.spark.sql.catalyst.TableIdentifier val tableId = TableIdentifier(tableName) val sessionCatalog = spark.sessionState.catalog sessionCatalog.dropTable(tableId, ignoreIfNotExists = true, purge = true) val df = Seq((0, 0.0, "zero"), (1, 1.4, "one")).toDF("id", "p1", "p2") df.write.saveAsTable("t1") // AnalyzeColumnCommand represents ANALYZE TABLE...FOR COLUMNS SQL command val allCols = df.columns.mkString(",") val analyzeTableSQL = s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS $allCols" val plan = spark.sql(analyzeTableSQL).queryExecution.logical import org.apache.spark.sql.execution.command.AnalyzeColumnCommand val cmd = plan.asInstanceOf[AnalyzeColumnCommand] scala> println(cmd) AnalyzeColumnCommand `t1`, [id, p1, p2] spark.sql(analyzeTableSQL) val stats = sessionCatalog.getTableMetadata(tableId).stats.get scala> println(stats.simpleString) 1421 bytes, 2 rows scala> stats.colStats.map { case (c, ss) => s"$c: $ss" }.foreach(println) id: ColumnStat(2,Some(0),Some(1),0,4,4,None) p1: ColumnStat(2,Some(0.0),Some(1.4),0,8,8,None) p2: ColumnStat(2,None,None,0,4,4,None) // Use DESC EXTENDED for friendlier output scala> sql(s"DESC EXTENDED $tableName id").show +--------------+----------+ | info_name|info_value| +--------------+----------+ | col_name| id| | data_type| int| | comment| NULL| | min| 0| | max| 1| | num_nulls| 0| |distinct_count| 2| | avg_col_len| 4| | max_col_len| 4| | histogram| NULL| +--------------+----------+ |
AnalyzeColumnCommand
can generate column histograms when spark.sql.statistics.histogram.enabled configuration property is turned on (which is disabled by default). AnalyzeColumnCommand
supports column histograms for the following data types:
-
IntegralType
-
DecimalType
-
DoubleType
-
FloatType
-
DateType
-
TimestampType
Note
|
Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true // Use the above example to set up the environment // Make sure that ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS was run with histogram enabled // There are 254 bins by default // Use spark.sql.statistics.histogram.numBins to control the bins val descExtSQL = s"DESC EXTENDED $tableName p1" scala> spark.sql(descExtSQL).show(truncate = false) +--------------+-----------------------------------------------------+ |info_name |info_value | +--------------+-----------------------------------------------------+ |col_name |p1 | |data_type |double | |comment |NULL | |min |0.0 | |max |1.4 | |num_nulls |0 | |distinct_count|2 | |avg_col_len |8 | |max_col_len |8 | |histogram |height: 0.007874015748031496, num_of_bins: 254 | |bin_0 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_1 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_2 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_3 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_4 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_5 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_6 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_7 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_8 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| |bin_9 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1| +--------------+-----------------------------------------------------+ only showing top 20 rows |
Note
|
AnalyzeColumnCommand is described by analyze labeled alternative in statement expression in SqlBase.g4 and parsed using SparkSqlAstBuilder.
|
Note
|
AnalyzeColumnCommand is not supported on views.
|
Executing Logical Command — run
Method
1 2 3 4 5 |
run(sparkSession: SparkSession): Seq[Row] |
Note
|
run is part of RunnableCommand Contract to execute (run) a logical command.
|
run
calculates the following statistics:
-
sizeInBytes
-
stats for each column
Caution
|
FIXME |
Computing Statistics for Specified Columns — computeColumnStats
Internal Method
1 2 3 4 5 6 7 8 |
computeColumnStats( sparkSession: SparkSession, tableIdent: TableIdentifier, columnNames: Seq[String]): (Long, Map[String, ColumnStat]) |
computeColumnStats
…FIXME
Note
|
computeColumnStats is used exclusively when AnalyzeColumnCommand is executed.
|
computePercentiles
Internal Method
1 2 3 4 5 6 7 8 |
computePercentiles( attributesToAnalyze: Seq[Attribute], sparkSession: SparkSession, relation: LogicalPlan): AttributeMap[ArrayData] |
computePercentiles
…FIXME
Note
|
computePercentiles is used exclusively when AnalyzeColumnCommand is executed (and computes column statistics).
|