关注 spark技术分享,
撸spark源码 玩spark最佳实践

CatalogTable — Table Specification (Native Table Metadata)

CatalogTable — Table Specification (Native Table Metadata)

CatalogTable is the table specification, i.e. the metadata of a table that is stored in a session-scoped catalog of relational entities (i.e. SessionCatalog).

CatalogTable is created when:

The readable text representation of a CatalogTable (aka simpleString) is…​FIXME

Note
simpleString is used exclusively when ShowTablesCommand logical command is executed (with a partition specification).

CatalogTable uses the following text representation (i.e. toString)…​FIXME

CatalogTable is created with the optional bucketing specification that is used for the following:

Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)

You manage a table metadata using the catalog interface (aka metastore). Among the management tasks is to get the statistics of a table (that are used for cost-based query optimization).

Note
The CatalogStatistics are optional when CatalogTable is created.
Caution
FIXME When are stats specified? What if there are not?

Unless CatalogStatistics are available in a table metadata (in a catalog) for a non-streaming file data source table, DataSource creates a HadoopFsRelation with the table size specified by spark.sql.defaultSizeInBytes internal property (default: Long.MaxValue) for query planning of joins (and possibly to auto broadcast the table).

Internally, Spark alters table statistics using ExternalCatalog.doAlterTableStats.

Unless CatalogStatistics are available in a table metadata (in a catalog) for HiveTableRelation (and hive provider) DetermineTableStats logical resolution rule can compute the table size using HDFS (if spark.sql.statistics.fallBackToHdfs property is turned on) or assume spark.sql.defaultSizeInBytes (that effectively disables table broadcasting).

You can use AnalyzeColumnCommand, AnalyzePartitionCommand, AnalyzeTableCommand commands to record statistics in a catalog.

The table statistics can be automatically updated (after executing commands like AlterTableAddPartitionCommand) when spark.sql.statistics.size.autoUpdate.enabled property is turned on.

You can use DESCRIBE SQL command to show the histogram of a column if stored in a catalog.

dataSchema Method

dataSchema…​FIXME

Note
dataSchema is used when…​FIXME

partitionSchema Method

partitionSchema…​FIXME

Note
partitionSchema is used when…​FIXME

Converting Table Specification to LinkedHashMap — toLinkedHashMap Method

toLinkedHashMap converts the table specification to a collection of pairs (LinkedHashMap[String, String]) with the following fields and their values:

Note

toLinkedHashMap is used when:

Creating CatalogTable Instance

CatalogTable takes the following when created:

  • TableIdentifier

  • CatalogTableType (i.e. EXTERNAL, MANAGED or VIEW)

  • CatalogStorageFormat

  • Schema

  • Name of the table provider (optional)

  • Partition column names

  • Optional Bucketing specification (default: None)

  • Owner

  • Create time

  • Last access time

  • Create version

  • Properties

  • Optional table statistics

  • Optional view text

  • Optional comment

  • Unsupported features

  • tracksPartitionsInCatalog flag

  • schemaPreservesCase flag

  • Ignored properties

database Method

database simply returns the database (of the TableIdentifier) or throws an AnalysisException:

Note
database is used when…​FIXME
赞(0) 打赏
未经允许不得转载:spark技术分享 » CatalogTable — Table Specification (Native Table Metadata)
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏