BucketSpec — Bucketing Specification of Table-spark技术分享

BucketSpec — Bucketing Specification of Table

BucketSpec is the bucketing specification of a table, i.e. the metadata of the bucketing of a table.

BucketSpec includes the following:

Number of buckets
Bucket column names – the names of the columns used for buckets (at least one)
Sort column names – the names of the columns used to sort data in buckets

The number of buckets has to be between 0 and 100000 exclusive (or an AnalysisException is thrown).

BucketSpec is created when:

DataFrameWriter is requested to saveAsTable (and does getBucketSpec)
HiveExternalCatalog is requested to getBucketSpecFromTableProperties and tableMetaToTableProps
HiveClientImpl is requested to retrieve a table metadata
SparkSqlAstBuilder is requested to visitBucketSpec (for CREATE TABLE SQL statement with CLUSTERED BY and INTO n BUCKETS with optional SORTED BY clauses)

BucketSpec uses the following text representation (i.e. toString):



[numBuckets] buckets, bucket columns: [[bucketColumnNames]], sort columns: [[sortColumnNames]]

[numBuckets] buckets, bucket columns: [[bucketColumnNames]], sort columns: [[sortColumnNames]]



import org.apache.spark.sql.catalyst.catalog.BucketSpec
val bucketSpec = BucketSpec(
  numBuckets = 8,
  bucketColumnNames = Seq("col1"),
  sortColumnNames = Seq("col2"))
scala> println(bucketSpec)
8 buckets, bucket columns: [col1], sort columns: [col2]

import org.apache.spark.sql.catalyst.catalog.BucketSpec

val bucketSpec = BucketSpec(

numBuckets = 8,

bucketColumnNames = Seq("col1"),

sortColumnNames = Seq("col2"))

scala> println(bucketSpec)

8 buckets, bucket columns: [col1], sort columns: [col2]

Converting Bucketing Specification to LinkedHashMap — `toLinkedHashMap` Method



toLinkedHashMap: mutable.LinkedHashMap[String, String]

toLinkedHashMap: mutable.LinkedHashMap[String, String]

toLinkedHashMap converts the bucketing specification to a collection of pairs (LinkedHashMap[String, String]) with the following fields and their values:

Num Buckets with the numBuckets
Bucket Columns with the bucketColumnNames
Sort Columns with the sortColumnNames

toLinkedHashMap quotes the column names.



scala> println(bucketSpec.toLinkedHashMap)
Map(Num Buckets -> 8, Bucket Columns -> [`col1`], Sort Columns -> [`col2`])

scala> println(bucketSpec.toLinkedHashMap)

Map(Num Buckets -> 8, Bucket Columns -> [`col1`], Sort Columns -> [`col2`])

Note	`toLinkedHashMap` is used when: `CatalogTable` is requested for toLinkedHashMap `DescribeTableCommand` logical command is executed with a non-empty partitionSpec and the isExtended flag on (that uses describeFormattedDetailedPartitionInfo).

BucketSpec — Bucketing Specification of Table