CatalogImpl
CatalogImpl
is the Catalog in Spark SQL that…FIXME
Note
|
CatalogImpl is in org.apache.spark.sql.internal package.
|
Creating Table — createTable
Method
1 2 3 4 5 6 7 8 9 |
createTable( tableName: String, source: String, schema: StructType, options: Map[String, String]): DataFrame |
Note
|
createTable is part of Catalog Contract to…FIXME.
|
createTable
…FIXME
getTable
Method
1 2 3 4 5 6 |
getTable(tableName: String): Table getTable(dbName: String, tableName: String): Table |
Note
|
getTable is part of Catalog Contract to…FIXME.
|
getTable
…FIXME
Caching Table or View In-Memory — cacheTable
Method
1 2 3 4 5 |
cacheTable(tableName: String): Unit |
Internally, cacheTable
first creates a DataFrame for the table followed by requesting CacheManager
to cache it.
Note
|
cacheTable uses the session-scoped SharedState to access the CacheManager .
|
Note
|
cacheTable is part of Catalog contract.
|
Removing All Cached Tables From In-Memory Cache — clearCache
Method
1 2 3 4 5 |
clearCache(): Unit |
clearCache
requests CacheManager
to remove all cached tables from in-memory cache.
Note
|
clearCache is part of Catalog contract.
|
Creating External Table From Path — createExternalTable
Method
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
createExternalTable(tableName: String, path: String): DataFrame createExternalTable(tableName: String, path: String, source: String): DataFrame createExternalTable( tableName: String, source: String, options: Map[String, String]): DataFrame createExternalTable( tableName: String, source: String, schema: StructType, options: Map[String, String]): DataFrame |
createExternalTable
creates an external table tableName
from the given path
and returns the corresponding DataFrame.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import org.apache.spark.sql.SparkSession val spark: SparkSession = ... val readmeTable = spark.catalog.createExternalTable("readme", "README.md", "text") readmeTable: org.apache.spark.sql.DataFrame = [value: string] scala> spark.catalog.listTables.filter(_.name == "readme").show +------+--------+-----------+---------+-----------+ | name|database|description|tableType|isTemporary| +------+--------+-----------+---------+-----------+ |readme| default| null| EXTERNAL| false| +------+--------+-----------+---------+-----------+ scala> sql("select count(*) as count from readme").show(false) +-----+ |count| +-----+ |99 | +-----+ |
The source
input parameter is the name of the data source provider for the table, e.g. parquet, json, text. If not specified, createExternalTable
uses spark.sql.sources.default setting to know the data source format.
Note
|
source input parameter must not be hive as it leads to a AnalysisException .
|
createExternalTable
sets the mandatory path
option when specified explicitly in the input parameter list.
createExternalTable
parses tableName
into TableIdentifier
(using SparkSqlParser). It creates a CatalogTable and then executes (by toRDD) a CreateTable logical plan. The result DataFrame is a Dataset[Row]
with the QueryExecution after executing SubqueryAlias logical plan and RowEncoder.
Note
|
createExternalTable is part of Catalog contract.
|
Listing Tables in Database (as Dataset) — listTables
Method
1 2 3 4 5 6 |
listTables(): Dataset[Table] listTables(dbName: String): Dataset[Table] |
Note
|
listTables is part of Catalog Contract to get a list of tables in the specified database.
|
Internally, listTables
requests SessionCatalog to list all tables in the specified dbName
database and converts them to Tables.
In the end, listTables
creates a Dataset with the tables.
Listing Columns of Table (as Dataset) — listColumns
Method
1 2 3 4 5 6 |
listColumns(tableName: String): Dataset[Column] listColumns(dbName: String, tableName: String): Dataset[Column] |
Note
|
listColumns is part of Catalog Contract to…FIXME.
|
listColumns
requests SessionCatalog for the table metadata.
listColumns
takes the schema from the table metadata and creates a Column
for every field (with the optional comment as the description).
In the end, listColumns
creates a Dataset with the columns.
Converting TableIdentifier to Table — makeTable
Internal Method
1 2 3 4 5 |
makeTable(tableIdent: TableIdentifier): Table |
makeTable
creates a Table
using the input TableIdentifier
and the table metadata (from the current SessionCatalog) if available.
Note
|
makeTable uses SparkSession to access SessionState that is then used to access SessionCatalog.
|
Note
|
makeTable is used when CatalogImpl is requested to listTables or getTable.
|
Creating Dataset from DefinedByConstructorParams Data — makeDataset
Method
1 2 3 4 5 6 7 |
makeDataset[T <: DefinedByConstructorParams]( data: Seq[T], sparkSession: SparkSession): Dataset[T] |
makeDataset
creates an ExpressionEncoder (from DefinedByConstructorParams) and encodes elements of the input data
to internal binary rows.
makeDataset
then creates a LocalRelation logical operator. makeDataset
requests SessionState
to execute the plan and creates the result Dataset
.
Note
|
makeDataset is used when CatalogImpl is requested to list databases, tables, functions and columns
|
Refreshing Analyzed Logical Plan of Table Query and Re-Caching It — refreshTable
Method
1 2 3 4 5 |
refreshTable(tableName: String): Unit |
Note
|
refreshTable is part of Catalog Contract to…FIXME.
|
refreshTable
requests SessionState
for the SQL parser to parse a TableIdentifier given the table name.
Note
|
refreshTable uses SparkSession to access the SessionState.
|
refreshTable
requests SessionCatalog for the table metadata.
refreshTable
then creates a DataFrame for the table name.
For a temporary or persistent VIEW
table, refreshTable
requests the analyzed logical plan of the DataFrame (for the table) to refresh itself.
For other types of table, refreshTable
requests SessionCatalog for refreshing the table metadata (i.e. invalidating the table).
If the table has been cached, refreshTable
requests CacheManager
to uncache and cache the table DataFrame
again.
Note
|
refreshTable uses SparkSession to access the SharedState that is used to access CacheManager.
|
refreshByPath
Method
1 2 3 4 5 |
refreshByPath(resourcePath: String): Unit |
Note
|
refreshByPath is part of Catalog Contract to…FIXME.
|
refreshByPath
…FIXME
listColumns
Internal Method
1 2 3 4 5 |
listColumns(tableIdentifier: TableIdentifier): Dataset[Column] |
listColumns
…FIXME
Note
|
listColumns is used exclusively when CatalogImpl is requested to listColumns.
|