关注 spark技术分享,
撸spark源码 玩spark最佳实践

CatalogImpl

CatalogImpl

CatalogImpl is the Catalog in Spark SQL that…​FIXME

spark sql CatalogImpl.png
Figure 1. CatalogImpl uses SessionCatalog (through SparkSession)
Note
CatalogImpl is in org.apache.spark.sql.internal package.

Creating Table — createTable Method

Note
createTable is part of Catalog Contract to…​FIXME.

createTable…​FIXME

getTable Method

Note
getTable is part of Catalog Contract to…​FIXME.

getTable…​FIXME

functionExists Method

Caution
FIXME

Caching Table or View In-Memory — cacheTable Method

Internally, cacheTable first creates a DataFrame for the table followed by requesting CacheManager to cache it.

Note
cacheTable uses the session-scoped SharedState to access the CacheManager.
Note
cacheTable is part of Catalog contract.

Removing All Cached Tables From In-Memory Cache — clearCache Method

clearCache requests CacheManager to remove all cached tables from in-memory cache.

Note
clearCache is part of Catalog contract.

Creating External Table From Path — createExternalTable Method

createExternalTable creates an external table tableName from the given path and returns the corresponding DataFrame.

The source input parameter is the name of the data source provider for the table, e.g. parquet, json, text. If not specified, createExternalTable uses spark.sql.sources.default setting to know the data source format.

Note
source input parameter must not be hive as it leads to a AnalysisException.

createExternalTable sets the mandatory path option when specified explicitly in the input parameter list.

createExternalTable parses tableName into TableIdentifier (using SparkSqlParser). It creates a CatalogTable and then executes (by toRDD) a CreateTable logical plan. The result DataFrame is a Dataset[Row] with the QueryExecution after executing SubqueryAlias logical plan and RowEncoder.

spark sql CatalogImpl createExternalTable.png
Figure 2. CatalogImpl.createExternalTable
Note
createExternalTable is part of Catalog contract.

Listing Tables in Database (as Dataset) — listTables Method

Note
listTables is part of Catalog Contract to get a list of tables in the specified database.

Internally, listTables requests SessionCatalog to list all tables in the specified dbName database and converts them to Tables.

In the end, listTables creates a Dataset with the tables.

Listing Columns of Table (as Dataset) — listColumns Method

Note
listColumns is part of Catalog Contract to…​FIXME.

listColumns requests SessionCatalog for the table metadata.

listColumns takes the schema from the table metadata and creates a Column for every field (with the optional comment as the description).

In the end, listColumns creates a Dataset with the columns.

Converting TableIdentifier to Table — makeTable Internal Method

makeTable creates a Table using the input TableIdentifier and the table metadata (from the current SessionCatalog) if available.

Note
makeTable uses SparkSession to access SessionState that is then used to access SessionCatalog.
Note
makeTable is used when CatalogImpl is requested to listTables or getTable.

Creating Dataset from DefinedByConstructorParams Data — makeDataset Method

makeDataset creates an ExpressionEncoder (from DefinedByConstructorParams) and encodes elements of the input data to internal binary rows.

makeDataset then creates a LocalRelation logical operator. makeDataset requests SessionState to execute the plan and creates the result Dataset.

Note
makeDataset is used when CatalogImpl is requested to list databases, tables, functions and columns

Refreshing Analyzed Logical Plan of Table Query and Re-Caching It — refreshTable Method

Note
refreshTable is part of Catalog Contract to…​FIXME.

refreshTable requests SessionState for the SQL parser to parse a TableIdentifier given the table name.

Note
refreshTable uses SparkSession to access the SessionState.

refreshTable requests SessionCatalog for the table metadata.

For a temporary or persistent VIEW table, refreshTable requests the analyzed logical plan of the DataFrame (for the table) to refresh itself.

For other types of table, refreshTable requests SessionCatalog for refreshing the table metadata (i.e. invalidating the table).

If the table has been cached, refreshTable requests CacheManager to uncache and cache the table DataFrame again.

Note
refreshTable uses SparkSession to access the SharedState that is used to access CacheManager.

refreshByPath Method

Note
refreshByPath is part of Catalog Contract to…​FIXME.

refreshByPath…​FIXME

listColumns Internal Method

listColumns…​FIXME

Note
listColumns is used exclusively when CatalogImpl is requested to listColumns.
赞(0) 打赏
未经允许不得转载:spark技术分享 » CatalogImpl
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏