spark-sql-spark技术分享-第52页

GlobalTempViewManager — Management Interface of Global Temporary Views

2012-03-09admin阅读(1612)

GlobalTempViewManager — Management Interface of Global Temporary Views

GlobalTempViewManager is the interface to manage global temporary views (that SessionCatalog uses when requested to create, alter or drop global temporary views).

Strictly speaking, GlobalTempViewManager simply manages the names of the global temporary views registered (and the corresponding logical plans) and has no interaction with other services in Spark SQL.

GlobalTempViewManager is available as globalTempViewManager property of a SharedState.

Figure 1. GlobalTempViewManager and SparkSession



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sharedState.globalTempViewManager
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.sharedState.globalTempViewManager

org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager

Table 1. GlobalTempViewManager API

Method

Description

clear



clear(): Unit

clear(): Unit

create



create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit

create(

viewDefinition: LogicalPlan,

overrideIfExists: Boolean): Unit

get



get(name: String): Option[LogicalPlan]

get(name: String): Option[LogicalPlan]

listViewNames



listViewNames(pattern: String): Seq[String]

listViewNames(pattern: String): Seq[String]

remove



remove(name: String): Boolean

remove(name: String): Boolean

rename



rename(oldName: String, newName: String): Boolean

rename(oldName: String, newName: String): Boolean

update



update(name: String, viewDefinition: LogicalPlan): Boolean

update(name: String, viewDefinition: LogicalPlan): Boolean

GlobalTempViewManager is created exclusively when SharedState is requested for one (for the very first time only as it is cached).

GlobalTempViewManager takes the name of the database when created.

spark sql GlobalTempViewManager creating instance.png

Figure 2. Creating GlobalTempViewManager

Table 2. GlobalTempViewManager’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`viewDefinitions`	Registry of global temporary view definitions as logical plans per view name.

`clear` Method



clear(): Unit

clear(): Unit

clear simply removes all the entries in the viewDefinitions internal registry.

Note	`clear` is used when `SessionCatalog` is requested to reset (that happens to be exclusively in the Spark SQL internal tests).

Creating (Registering) Global Temporary View (Definition) — `create` Method



create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit

create(

viewDefinition: LogicalPlan,

overrideIfExists: Boolean): Unit

create simply registers (adds) the input LogicalPlan under the input name.

create throws an AnalysisException when the input overrideIfExists flag is off and the viewDefinitions internal registry contains the input name.



Temporary view '[table]' already exists

Temporary view '[table]' already exists

Note	`create` is used when `SessionCatalog` is requested to createGlobalTempView (when CreateViewCommand and CreateTempViewUsing logical commands are executed).

Retrieving Global View Definition Per Name — `get` Method



get(name: String): Option[LogicalPlan]

get(name: String): Option[LogicalPlan]

get simply returns the LogicalPlan that was registered under the name if it defined.

Note	`get` is used when `SessionCatalog` is requested to getGlobalTempView, getTempViewOrPermanentTableMetadata, lookupRelation, isTemporaryTable or refreshTable.

Listing Global Temporary Views For Pattern — `listViewNames` Method



listViewNames(pattern: String): Seq[String]

listViewNames(pattern: String): Seq[String]

listViewNames simply gives a list of the global temporary views with names matching the input pattern.

Note	`listViewNames` is used exclusively when `SessionCatalog` is requested to listTables

Removing (De-Registering) Global Temporary View — `remove` Method



remove(name: String): Boolean

remove(name: String): Boolean

remove simply tries to remove the name from the viewDefinitions internal registry and returns true when removed or false otherwise.

Note	`remove` is used when `SessionCatalog` is requested to drop a global temporary view or table.

`rename` Method



rename(oldName: String, newName: String): Boolean

rename(oldName: String, newName: String): Boolean

rename…FIXME

Note	`rename` is used when…FIXME

`update` Method



update(name: String, viewDefinition: LogicalPlan): Boolean

update(name: String, viewDefinition: LogicalPlan): Boolean

update…FIXME

Note	`update` is used exclusively when `SessionCatalog` is requested to alter a global temporary view.

FunctionRegistry — Contract for Function Registries (Catalogs)

2012-03-08admin阅读(1851)

FunctionRegistry — Contract for Function Registries (Catalogs)

FunctionRegistry is the contract of function registries (catalogs) of native and user-defined functions.



package org.apache.spark.sql.catalyst.analysis

trait FunctionRegistry {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def clear(): Unit
  def dropFunction(name: FunctionIdentifier): Boolean
  def listFunction(): Seq[FunctionIdentifier]
  def lookupFunction(name: FunctionIdentifier): Option[ExpressionInfo]
  def lookupFunction(name: FunctionIdentifier, children: Seq[Expression]): Expression
  def lookupFunctionBuilder(name: FunctionIdentifier): Option[FunctionBuilder]
  def registerFunction(
    name: FunctionIdentifier,
    info: ExpressionInfo,
    builder: FunctionBuilder): Unit
}

package org.apache.spark.sql.catalyst.analysis

trait FunctionRegistry {

// only required properties (vals and methods) that have no implementation

// the others follow

def clear(): Unit

def dropFunction(name: FunctionIdentifier): Boolean

def listFunction(): Seq[FunctionIdentifier]

def lookupFunction(name: FunctionIdentifier): Option[ExpressionInfo]

def lookupFunction(name: FunctionIdentifier, children: Seq[Expression]): Expression

def lookupFunctionBuilder(name: FunctionIdentifier): Option[FunctionBuilder]

def registerFunction(

info: ExpressionInfo,

builder: FunctionBuilder): Unit

}

Table 1. FunctionRegistry Contract
Property	Description
`clear`	Used exclusively when `SessionCatalog` is requested to reset
`dropFunction`	Used when…FIXME
`listFunction`	Used when…FIXME
`lookupFunction`	Used when: `FunctionRegistry` is requested to functionExists `SessionCatalog` is requested to find a function by name, lookupFunctionInfo or reset `HiveSessionCatalog` is requested to lookupFunction0
`lookupFunctionBuilder`	Used when…FIXME
`registerFunction`	Used when: `SessionCatalog` is requested to registerFunction or reset `FunctionRegistry` is requested for a SimpleFunctionRegistry with the built-in functions registered or createOrReplaceTempFunction `SimpleFunctionRegistry` is requested to `clone`

Note	The one and only `FunctionRegistry` available in Spark SQL is SimpleFunctionRegistry.

FunctionRegistry is available through functionRegistry property of a SessionState (that is available as sessionState property of a SparkSession).



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.functionRegistry
org.apache.spark.sql.catalyst.analysis.FunctionRegistry

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.functionRegistry

org.apache.spark.sql.catalyst.analysis.FunctionRegistry

Note	You can register a new user-defined function using UDFRegistration.

Table 2. FunctionRegistry’s Attributes
Name	Description
`builtin`	SimpleFunctionRegistry with the built-in functions registered.

FunctionRegistry manages function expression registry of Catalyst expressions and the corresponding built-in/native SQL functions (that can be used in SQL statements).

Table 3. (Subset of) FunctionRegistry’s Catalyst Expression to SQL Function Mapping
Catalyst Expression	SQL Function
CumeDist	`cume_dist`
`IfNull`	`ifnull`
`Left`	`left`
MonotonicallyIncreasingID	`monotonically_increasing_id`
`NullIf`	`nullif`
`Nvl`	`nvl`
`Nvl2`	`nvl2`
ParseToDate	`to_date`
ParseToTimestamp	`to_timestamp`
`Right`	`right`
CreateNamedStruct	`struct`

`expression` Internal Method



expression[T <: Expression](name: String)
  (implicit tag: ClassTag[T]): (String, (ExpressionInfo, FunctionBuilder))

expression[T <: Expression](name: String)

(implicit tag: ClassTag[T]): (String, (ExpressionInfo, FunctionBuilder))

expression…FIXME

Note	`expression` is used when…FIXME

SimpleFunctionRegistry

SimpleFunctionRegistry is the default FunctionRegistry that is backed by a hash map (with optional case sensitivity).

`createOrReplaceTempFunction` Final Method



createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit

createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit

createOrReplaceTempFunction…FIXME

Note	`createOrReplaceTempFunction` is used exclusively when `UDFRegistration` is requested to register an user-defined function, user-defined aggregate function, user-defined function (as UserDefinedFunction) or `registerPython`.

`functionExists` Method



functionExists(name: FunctionIdentifier): Boolean

functionExists(name: FunctionIdentifier): Boolean

functionExists…FIXME

Note	`functionExists` is used when…FIXME

HiveExternalCatalog — Hive-Aware Metastore of Permanent Relational Entities

2012-03-07admin阅读(1703)

HiveExternalCatalog — Hive-Aware Metastore of Permanent Relational Entities

HiveExternalCatalog is a external catalog of permanent relational entities (aka metastore) that is used when SparkSession was created with Hive support enabled.

Figure 1. HiveExternalCatalog and SharedState

HiveExternalCatalog is created exclusively when SharedState is requested for the ExternalCatalog for the first time (and spark.sql.catalogImplementation internal configuration property is hive).

Note	The Hadoop configuration to create a `HiveExternalCatalog` is the default Hadoop configuration from Spark Core’s `SparkContext.hadoopConfiguration` with the Spark properties with `spark.hadoop` prefix.

HiveExternalCatalog uses the internal HiveClient to retrieve metadata from a Hive metastore.



import org.apache.spark.sql.internal.StaticSQLConf
val catalogType = spark.conf.get(StaticSQLConf.CATALOG_IMPLEMENTATION.key)
scala> println(catalogType)
hive

// Alternatively...
scala> spark.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)
res1: String = hive

// Or you could use the property key by name
scala> spark.conf.get("spark.sql.catalogImplementation")
res1: String = hive

val metastore = spark.sharedState.externalCatalog
scala> :type metastore
org.apache.spark.sql.catalyst.catalog.ExternalCatalog

// Since Hive is enabled HiveExternalCatalog is the metastore
scala> println(metastore)
org.apache.spark.sql.hive.HiveExternalCatalog@25e95d04

import org.apache.spark.sql.internal.StaticSQLConf

val catalogType = spark.conf.get(StaticSQLConf.CATALOG_IMPLEMENTATION.key)

scala> println(catalogType)

hive

// Alternatively...

scala> spark.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)

res1: String = hive

// Or you could use the property key by name

scala> spark.conf.get("spark.sql.catalogImplementation")

res1: String = hive

val metastore = spark.sharedState.externalCatalog

scala> :type metastore

org.apache.spark.sql.catalyst.catalog.ExternalCatalog

// Since Hive is enabled HiveExternalCatalog is the metastore

scala> println(metastore)

org.apache.spark.sql.hive.HiveExternalCatalog@25e95d04

Note

spark.sql.catalogImplementation configuration property is in-memory by default.

Use Builder.enableHiveSupport to enable Hive support (that sets spark.sql.catalogImplementation internal configuration property to hive when the Hive classes are available).



import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .enableHiveSupport()  // <-- enables Hive support
  .getOrCreate

import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession.builder

.enableHiveSupport() // <-- enables Hive support

.getOrCreate

Tip

Use spark.sql.warehouse.dir Spark property to change the location of Hive’s hive.metastore.warehouse.dir property, i.e. the location of the Hive local/embedded metastore database (using Derby).

Refer to SharedState to learn about (the low-level details of) Spark SQL support for Apache Hive.

See also the official Hive Metastore Administration document.

Table 1. HiveExternalCatalog’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`client`	HiveClient for retrieving metadata from a Hive metastore Created by requesting `HiveUtils` for a new HiveClientImpl (with the current SparkConf and Hadoop Configuration)

`getRawTable` Method



getRawTable(db: String, table: String): CatalogTable

getRawTable(db: String, table: String): CatalogTable

getRawTable…FIXME

Note	`getRawTable` is used when…FIXME

`doAlterTableStats` Method



doAlterTableStats(
  db: String,
  table: String,
  stats: Option[CatalogStatistics]): Unit

doAlterTableStats(

db: String,

table: String,

stats: Option[CatalogStatistics]): Unit

Note	`doAlterTableStats` is part of ExternalCatalog Contract to alter the statistics of a table.

doAlterTableStats…FIXME

Converting Table Statistics to Properties — `statsToProperties` Internal Method



statsToProperties(
  stats: CatalogStatistics,
  schema: StructType): Map[String, String]

statsToProperties(

stats: CatalogStatistics,

schema: StructType): Map[String, String]

statsToProperties converts the table statistics to properties (i.e. key-value pairs that will be persisted as properties in the table metadata to a Hive metastore using the Hive client).

statsToProperties adds the following properties to the properties:

spark.sql.statistics.totalSize with total size (in bytes)
(if defined) spark.sql.statistics.numRows with number of rows

statsToProperties takes the column statistics and for every column (field) in schema converts the column statistics to properties and adds the properties (as column statistic property) to the properties.

Note	`statsToProperties` is used when `HiveExternalCatalog` is requested for: doAlterTableStats alterPartitions

Restoring Table Statistics from Properties (from Hive Metastore) — `statsFromProperties` Internal Method



statsFromProperties(
  properties: Map[String, String],
  table: String,
  schema: StructType): Option[CatalogStatistics]

statsFromProperties(

properties: Map[String, String],

table: String,

schema: StructType): Option[CatalogStatistics]

statsFromProperties collects statistics-related properties, i.e. the properties with their keys with spark.sql.statistics prefix.

statsFromProperties returns None if there are no keys with the spark.sql.statistics prefix in properties.

If there are keys with spark.sql.statistics prefix, statsFromProperties creates a ColumnStat that is the column statistics for every column in schema.

For every column name in schema statsFromProperties collects all the keys that start with spark.sql.statistics.colStats.[name] prefix (after having checked that the key spark.sql.statistics.colStats.[name].version exists that is a marker that the column statistics exist in the statistics properties) and converts them to a ColumnStat (for the column name).

In the end, statsFromProperties creates a CatalogStatistics with the following properties:

sizeInBytes as spark.sql.statistics.totalSize property
rowCount as spark.sql.statistics.numRows property
colStats as the collection of the column names and their ColumnStat (calculated above)

Note	`statsFromProperties` is used when `HiveExternalCatalog` is requested for restoring table and partition metadata.

`listPartitionsByFilter` Method



listPartitionsByFilter(
  db: String,
  table: String,
  predicates: Seq[Expression],
  defaultTimeZoneId: String): Seq[CatalogTablePartition]

listPartitionsByFilter(

db: String,

table: String,

predicates: Seq[Expression],

defaultTimeZoneId: String): Seq[CatalogTablePartition]

Note	`listPartitionsByFilter` is part of ExternalCatalog Contract to…FIXME.

listPartitionsByFilter…FIXME

`alterPartitions` Method



alterPartitions(
  db: String,
  table: String,
  newParts: Seq[CatalogTablePartition]): Unit

alterPartitions(

db: String,

table: String,

newParts: Seq[CatalogTablePartition]): Unit

Note	`alterPartitions` is part of ExternalCatalog Contract to…FIXME.

alterPartitions…FIXME

`getTable` Method



getTable(db: String, table: String): CatalogTable

getTable(db: String, table: String): CatalogTable

Note	`getTable` is part of ExternalCatalog Contract to…FIXME.

getTable…FIXME

`doAlterTable` Method



doAlterTable(tableDefinition: CatalogTable): Unit

doAlterTable(tableDefinition: CatalogTable): Unit

Note	`doAlterTable` is part of ExternalCatalog Contract to alter a table.

doAlterTable…FIXME

`restorePartitionMetadata` Internal Method



restorePartitionMetadata(
  partition: CatalogTablePartition,
  table: CatalogTable): CatalogTablePartition

restorePartitionMetadata(

partition: CatalogTablePartition,

table: CatalogTable): CatalogTablePartition

restorePartitionMetadata…FIXME

Note	`restorePartitionMetadata` is used when `HiveExternalCatalog` is requested for: getPartition getPartitionOption

`getPartition` Method



getPartition(
  db: String,
  table: String,
  spec: TablePartitionSpec): CatalogTablePartition

getPartition(

db: String,

table: String,

spec: TablePartitionSpec): CatalogTablePartition

Note	`getPartition` is part of ExternalCatalog Contract to…FIXME.

getPartition…FIXME

`getPartitionOption` Method



getPartitionOption(
  db: String,
  table: String,
  spec: TablePartitionSpec): Option[CatalogTablePartition]

getPartitionOption(

db: String,

table: String,

spec: TablePartitionSpec): Option[CatalogTablePartition]

Note	`getPartitionOption` is part of ExternalCatalog Contract to…FIXME.

getPartitionOption…FIXME

Creating HiveExternalCatalog Instance

HiveExternalCatalog takes the following when created:

Spark configuration (i.e. SparkConf)
Hadoop’s Configuration

Building Property Name for Column and Statistic Key — `columnStatKeyPropName` Internal Method



columnStatKeyPropName(columnName: String, statKey: String): String

columnStatKeyPropName(columnName: String, statKey: String): String

columnStatKeyPropName builds a property name of the form spark.sql.statistics.colStats.[columnName].[statKey] for the input columnName and statKey.

Note	`columnStatKeyPropName` is used when `HiveExternalCatalog` is requested to statsToProperties and statsFromProperties.

`getBucketSpecFromTableProperties` Internal Method



getBucketSpecFromTableProperties(metadata: CatalogTable): Option[BucketSpec]

getBucketSpecFromTableProperties(metadata: CatalogTable): Option[BucketSpec]

getBucketSpecFromTableProperties…FIXME

Note	`getBucketSpecFromTableProperties` is used when `HiveExternalCatalog` is requested to restoreHiveSerdeTable or restoreDataSourceTable.

Restoring Hive Serde Table — `restoreHiveSerdeTable` Internal Method



restoreHiveSerdeTable(table: CatalogTable): CatalogTable

restoreHiveSerdeTable(table: CatalogTable): CatalogTable

restoreHiveSerdeTable…FIXME

Note	`restoreHiveSerdeTable` is used exclusively when `HiveExternalCatalog` is requested to restoreTableMetadata (when there is no provider specified in table properties, which means this is a Hive serde table).

Restoring Data Source Table — `restoreDataSourceTable` Internal Method



restoreDataSourceTable(table: CatalogTable, provider: String): CatalogTable

restoreDataSourceTable(table: CatalogTable, provider: String): CatalogTable

restoreDataSourceTable…FIXME

Note	`restoreDataSourceTable` is used exclusively when `HiveExternalCatalog` is requested to restoreTableMetadata (for regular data source table with provider specified in table properties).

`restoreTableMetadata` Internal Method



restoreTableMetadata(inputTable: CatalogTable): CatalogTable

restoreTableMetadata(inputTable: CatalogTable): CatalogTable

restoreTableMetadata…FIXME

Note	`restoreTableMetadata` is used when `HiveExternalCatalog` is requested for: getTable doAlterTableStats alterPartitions listPartitionsByFilter

Retrieving CatalogTablePartition of Table — `listPartitions` Method



listPartitions(
  db: String,
  table: String,
  partialSpec: Option[TablePartitionSpec] = None): Seq[CatalogTablePartition]

listPartitions(

db: String,

table: String,

partialSpec: Option[TablePartitionSpec] = None): Seq[CatalogTablePartition]

Note	`listPartitions` is part of the ExternalCatalog Contract to list partitions of a table.

listPartitions…FIXME

`doCreateTable` Method



doCreateTable(
  tableDefinition: CatalogTable,
  ignoreIfExists: Boolean): Unit

doCreateTable(

tableDefinition: CatalogTable,

ignoreIfExists: Boolean): Unit

Note	`doCreateTable` is part of the ExternalCatalog Contract to…FIXME.

doCreateTable…FIXME

`tableMetaToTableProps` Internal Method



tableMetaToTableProps(table: CatalogTable): mutable.Map[String, String]
tableMetaToTableProps(
  table: CatalogTable,
  schema: StructType): mutable.Map[String, String]

tableMetaToTableProps(table: CatalogTable): mutable.Map[String, String]

tableMetaToTableProps(

table: CatalogTable,

schema: StructType): mutable.Map[String, String]

tableMetaToTableProps…FIXME

Note	`tableMetaToTableProps` is used when `HiveExternalCatalog` is requested to doAlterTableDataSchema and doCreateTable (and createDataSourceTable).

`doAlterTableDataSchema` Method



doAlterTableDataSchema(
  db: String,
  table: String,
  newDataSchema: StructType): Unit

doAlterTableDataSchema(

db: String,

table: String,

newDataSchema: StructType): Unit

Note	`doAlterTableDataSchema` is part of the ExternalCatalog Contract to…FIXME.

doAlterTableDataSchema…FIXME

`createDataSourceTable` Internal Method



createDataSourceTable(table: CatalogTable, ignoreIfExists: Boolean): Unit

createDataSourceTable(table: CatalogTable, ignoreIfExists: Boolean): Unit

createDataSourceTable…FIXME

Note	`createDataSourceTable` is used exclusively when `HiveExternalCatalog` is requested to doCreateTable (for non-hive providers).

InMemoryCatalog

2012-03-06admin阅读(1751)

InMemoryCatalog

InMemoryCatalog is…FIXME

`listPartitionsByFilter` Method



listPartitionsByFilter(
  db: String,
  table: String,
  predicates: Seq[Expression],
  defaultTimeZoneId: String): Seq[CatalogTablePartition]

listPartitionsByFilter(

db: String,

table: String,

predicates: Seq[Expression],

defaultTimeZoneId: String): Seq[CatalogTablePartition]

Note	`listPartitionsByFilter` is part of ExternalCatalog Contract to…FIXME.

listPartitionsByFilter…FIXME

ExternalCatalog Contract — External Catalog (Metastore) of Permanent Relational Entities

2012-03-05admin阅读(1698)

ExternalCatalog Contract — External Catalog (Metastore) of Permanent Relational Entities

ExternalCatalog is the contract of an external system catalog (aka metadata registry or metastore) of permanent relational entities, i.e. databases, tables, partitions, and functions.

Table 1. ExternalCatalog’s Features per Relational Entity
Feature	Database	Function	Partition	Table
Alter	alterDatabase	alterFunction	alterPartitions	alterTable, alterTableDataSchema, alterTableStats
Create	createDatabase	createFunction	createPartitions	createTable
Drop	dropDatabase	dropFunction	dropPartitions	dropTable
Get	getDatabase	getFunction	getPartition, getPartitionOption	getTable
List	listDatabases	listFunctions	listPartitionNames, listPartitions, listPartitionsByFilter	listTables
Load			loadDynamicPartitions, loadPartition	loadTable
Rename		renameFunction	renamePartitions	renameTable
Check Existence	databaseExists	functionExists		tableExists
Set				setCurrentDatabase

Method Description

alterPartitions



alterPartitions(
  db: String,
  table: String,
  parts: Seq[CatalogTablePartition]): Unit

alterPartitions(

db: String,

table: String,

parts: Seq[CatalogTablePartition]): Unit

createPartitions



createPartitions(
  db: String,
  table: String,
  parts: Seq[CatalogTablePartition],
  ignoreIfExists: Boolean): Unit

createPartitions(

db: String,

table: String,

parts: Seq[CatalogTablePartition],

ignoreIfExists: Boolean): Unit

databaseExists



databaseExists(db: String): Boolean

databaseExists(db: String): Boolean

doAlterDatabase



doAlterDatabase(dbDefinition: CatalogDatabase): Unit

doAlterDatabase(dbDefinition: CatalogDatabase): Unit

doAlterFunction



doAlterFunction(db: String, funcDefinition: CatalogFunction): Unit

doAlterFunction(db: String, funcDefinition: CatalogFunction): Unit

doAlterTable



doAlterTable(tableDefinition: CatalogTable): Unit

doAlterTable(tableDefinition: CatalogTable): Unit

doAlterTableDataSchema



doAlterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit

doAlterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit

doAlterTableStats



doAlterTableStats(db: String, table: String, stats: Option[CatalogStatistics]): Unit

doAlterTableStats(db: String, table: String, stats: Option[CatalogStatistics]): Unit

doCreateDatabase



doCreateDatabase(dbDefinition: CatalogDatabase, ignoreIfExists: Boolean): Unit

doCreateDatabase(dbDefinition: CatalogDatabase, ignoreIfExists: Boolean): Unit

doCreateFunction



doCreateFunction(db: String, funcDefinition: CatalogFunction): Unit

doCreateFunction(db: String, funcDefinition: CatalogFunction): Unit

doCreateTable



doCreateTable(tableDefinition: CatalogTable, ignoreIfExists: Boolean): Unit

doCreateTable(tableDefinition: CatalogTable, ignoreIfExists: Boolean): Unit

doDropDatabase



doDropDatabase(db: String, ignoreIfNotExists: Boolean, cascade: Boolean): Unit

doDropDatabase(db: String, ignoreIfNotExists: Boolean, cascade: Boolean): Unit

doDropFunction



doDropFunction(db: String, funcName: String): Unit

doDropFunction(db: String, funcName: String): Unit

doDropTable



doDropTable(
  db: String,
  table: String,
  ignoreIfNotExists: Boolean,
  purge: Boolean): Unit

doDropTable(

db: String,

table: String,

ignoreIfNotExists: Boolean,

purge: Boolean): Unit

doRenameFunction



doRenameFunction(db: String, oldName: String, newName: String): Unit

doRenameFunction(db: String, oldName: String, newName: String): Unit

doRenameTable



doRenameTable(db: String, oldName: String, newName: String): Unit

doRenameTable(db: String, oldName: String, newName: String): Unit

dropPartitions



dropPartitions(
  db: String,
  table: String,
  parts: Seq[TablePartitionSpec],
  ignoreIfNotExists: Boolean,
  purge: Boolean,
  retainData: Boolean): Unit

dropPartitions(

db: String,

table: String,

parts: Seq[TablePartitionSpec],

ignoreIfNotExists: Boolean,

purge: Boolean,

retainData: Boolean): Unit

functionExists



functionExists(db: String, funcName: String): Boolean

functionExists(db: String, funcName: String): Boolean

getDatabase



getDatabase(db: String): CatalogDatabase

getDatabase(db: String): CatalogDatabase

getFunction



getFunction(db: String, funcName: String): CatalogFunction

getFunction(db: String, funcName: String): CatalogFunction

getPartition



getPartition(db: String, table: String, spec: TablePartitionSpec): CatalogTablePartition

getPartition(db: String, table: String, spec: TablePartitionSpec): CatalogTablePartition

getPartitionOption



getPartitionOption(
  db: String,
  table: String,
  spec: TablePartitionSpec): Option[CatalogTablePartition]

getPartitionOption(

db: String,

table: String,

spec: TablePartitionSpec): Option[CatalogTablePartition]

getTable



getTable(db: String, table: String): CatalogTable

getTable(db: String, table: String): CatalogTable

listDatabases



listDatabases(): Seq[String]
listDatabases(pattern: String): Seq[String]

listDatabases(): Seq[String]

listDatabases(pattern: String): Seq[String]

listFunctions



listFunctions(db: String, pattern: String): Seq[String]

listFunctions(db: String, pattern: String): Seq[String]

listPartitionNames



listPartitionNames(
  db: String,
  table: String,
  partialSpec: Option[TablePartitionSpec] = None): Seq[String]

listPartitionNames(

db: String,

table: String,

partialSpec: Option[TablePartitionSpec] = None): Seq[String]

listPartitions



listPartitions(
  db: String,
  table: String,
  partialSpec: Option[TablePartitionSpec] = None): Seq[CatalogTablePartition]

listPartitions(

db: String,

table: String,

partialSpec: Option[TablePartitionSpec] = None): Seq[CatalogTablePartition]

listPartitionsByFilter



listPartitionsByFilter(
  db: String,
  table: String,
  predicates: Seq[Expression],
  defaultTimeZoneId: String): Seq[CatalogTablePartition]

listPartitionsByFilter(

db: String,

table: String,

predicates: Seq[Expression],

defaultTimeZoneId: String): Seq[CatalogTablePartition]

listTables



listTables(db: String): Seq[String]
listTables(db: String, pattern: String): Seq[String]

listTables(db: String): Seq[String]

listTables(db: String, pattern: String): Seq[String]

loadDynamicPartitions



loadDynamicPartitions(
  db: String,
  table: String,
  loadPath: String,
  partition: TablePartitionSpec,
  replace: Boolean,
  numDP: Int): Unit

loadDynamicPartitions(

db: String,

table: String,

loadPath: String,

partition: TablePartitionSpec,

replace: Boolean,

numDP: Int): Unit

loadPartition



loadPartition(
  db: String,
  table: String,
  loadPath: String,
  partition: TablePartitionSpec,
  isOverwrite: Boolean,
  inheritTableSpecs: Boolean,
  isSrcLocal: Boolean): Unit

loadPartition(

db: String,

table: String,

loadPath: String,

partition: TablePartitionSpec,

isOverwrite: Boolean,

inheritTableSpecs: Boolean,

isSrcLocal: Boolean): Unit

loadTable



loadTable(
  db: String,
  table: String,
  loadPath: String,
  isOverwrite: Boolean,
  isSrcLocal: Boolean): Unit

loadTable(

db: String,

table: String,

loadPath: String,

isOverwrite: Boolean,

isSrcLocal: Boolean): Unit

renamePartitions



renamePartitions(
  db: String,
  table: String,
  specs: Seq[TablePartitionSpec],
  newSpecs: Seq[TablePartitionSpec]): Unit

renamePartitions(

db: String,

table: String,

specs: Seq[TablePartitionSpec],

newSpecs: Seq[TablePartitionSpec]): Unit

setCurrentDatabase



setCurrentDatabase(db: String): Unit

setCurrentDatabase(db: String): Unit

tableExists



tableExists(db: String, table: String): Boolean

tableExists(db: String, table: String): Boolean

ExternalCatalog is available as externalCatalog of SharedState (in SparkSession).



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sharedState.externalCatalog
org.apache.spark.sql.catalyst.catalog.ExternalCatalog

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.sharedState.externalCatalog

org.apache.spark.sql.catalyst.catalog.ExternalCatalog

ExternalCatalog is available as ephemeral in-memory or persistent hive-aware.

Table 3. ExternalCatalogs
ExternalCatalog	Alias	Description
HiveExternalCatalog	`hive`	A persistent system catalog using a Hive metastore.
InMemoryCatalog	`in-memory`	An in-memory (ephemeral) system catalog that does not require setting up external systems (like a Hive metastore). It is intended for testing or exploration purposes only and therefore should not be used in production.

The concrete ExternalCatalog is chosen using Builder.enableHiveSupport that enables the Hive support (and sets spark.sql.catalogImplementation configuration property to hive when the Hive classes are available).



import org.apache.spark.sql.internal.StaticSQLConf
val catalogType = spark.conf.get(StaticSQLConf.CATALOG_IMPLEMENTATION.key)
scala> println(catalogType)
hive

scala> spark.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)
res1: String = hive

import org.apache.spark.sql.internal.StaticSQLConf

val catalogType = spark.conf.get(StaticSQLConf.CATALOG_IMPLEMENTATION.key)

scala> println(catalogType)

hive

scala> spark.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)

res1: String = hive

Tip

Set spark.sql.catalogImplementation to in-memory when starting spark-shell to use InMemoryCatalog external catalog.



// spark-shell --conf spark.sql.catalogImplementation=in-memory

import org.apache.spark.sql.internal.StaticSQLConf
scala> spark.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)
res0: String = in-memory

// spark-shell --conf spark.sql.catalogImplementation=in-memory

import org.apache.spark.sql.internal.StaticSQLConf

scala> spark.sessionState.conf.getConf(StaticSQLConf.CATALOG_IMPLEMENTATION)

res0: String = in-memory

Important

You cannot change ExternalCatalog after SparkSession has been created using spark.sql.catalogImplementation configuration property as it is a static configuration.



import org.apache.spark.sql.internal.StaticSQLConf
scala> spark.conf.set(StaticSQLConf.CATALOG_IMPLEMENTATION.key, "hive")
org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.catalogImplementation;
  at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:144)
  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41)
  ... 49 elided

import org.apache.spark.sql.internal.StaticSQLConf

scala> spark.conf.set(StaticSQLConf.CATALOG_IMPLEMENTATION.key, "hive")

org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.catalogImplementation;

at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:144)

at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41)

... 49 elided

ExternalCatalog is a ListenerBus of ExternalCatalogEventListener listeners that handle ExternalCatalogEvent events.

Tip	Use `addListener` and `removeListener` to register and de-register `ExternalCatalogEventListener` listeners, accordingly. Read ListenerBus Event Bus Contract in Mastering Apache Spark 2 gitbook to learn more about Spark Core’s `ListenerBus` interface.

Altering Table Statistics — `alterTableStats` Method



alterTableStats(db: String, table: String, stats: Option[CatalogStatistics]): Unit

alterTableStats(db: String, table: String, stats: Option[CatalogStatistics]): Unit

alterTableStats…FIXME

Note	`alterTableStats` is used exclusively when `SessionCatalog` is requested for altering the statistics of a table in a metastore (that can happen when any logical command is executed that could change the table statistics).

Altering Table — `alterTable` Method



alterTable(tableDefinition: CatalogTable): Unit

alterTable(tableDefinition: CatalogTable): Unit

alterTable…FIXME

Note	`alterTable` is used exclusively when `SessionCatalog` is requested for altering the statistics of a table in a metastore.

`createTable` Method



createTable(tableDefinition: CatalogTable, ignoreIfExists: Boolean): Unit

createTable(tableDefinition: CatalogTable, ignoreIfExists: Boolean): Unit

createTable…FIXME

Note	`createTable` is used when…FIXME

`alterTableDataSchema` Method



alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit

alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit

alterTableDataSchema…FIXME

Note	`alterTableDataSchema` is used exclusively when `SessionCatalog` is requested to alterTableDataSchema.

ExperimentalMethods

2012-03-04admin阅读(1721)

ExperimentalMethods

ExperimentalMethods holds extra optimizations and strategies that are used in SparkOptimizer and SparkPlanner, respectively.

Name Description

extraOptimizations

Collection of rules to optimize LogicalPlans (i.e. Rule[LogicalPlan] objects)



extraOptimizations: Seq[Rule[LogicalPlan]]

extraOptimizations: Seq[Rule[LogicalPlan]]

Used when SparkOptimizer is requested for the User Provided Optimizers

extraStrategies

Collection of SparkStrategies



extraStrategies: Seq[Strategy]

extraStrategies: Seq[Strategy]

Used when SessionState is requested for the SparkPlanner

ExperimentalMethods is available as the experimental property of a SparkSession.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.experimental
org.apache.spark.sql.ExperimentalMethods

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.experimental

org.apache.spark.sql.ExperimentalMethods

Example



import org.apache.spark.sql.catalyst.rules.Rule
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

object SampleRule extends Rule[LogicalPlan] {
  def apply(p: LogicalPlan): LogicalPlan = p
}

scala> :type spark
org.apache.spark.sql.SparkSession

spark.experimental.extraOptimizations = Seq(SampleRule)

// extraOptimizations is used in Spark Optimizer
val rule = spark.sessionState.optimizer.batches.flatMap(_.rules).filter(_ == SampleRule).head
scala> rule.ruleName
res0: String = SampleRule

import org.apache.spark.sql.catalyst.rules.Rule

import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

object SampleRule extends Rule[LogicalPlan] {

def apply(p: LogicalPlan): LogicalPlan = p

}

scala> :type spark

org.apache.spark.sql.SparkSession

spark.experimental.extraOptimizations = Seq(SampleRule)

// extraOptimizations is used in Spark Optimizer

val rule = spark.sessionState.optimizer.batches.flatMap(_.rules).filter(_ == SampleRule).head

scala> rule.ruleName

res0: String = SampleRule

ExecutionListenerManager — Management Interface of QueryExecutionListeners

2012-03-03admin阅读(1791)

ExecutionListenerManager — Management Interface of QueryExecutionListeners

ExecutionListenerManager is the management interface for QueryExecutionListeners that listen for execution metrics:

Name of the action (that triggered a query execution)
QueryExecution
Execution time of this query (in nanoseconds)

ExecutionListenerManager is available as listenerManager property of SparkSession (and listenerManager property of SessionState).



scala> :type spark.listenerManager
org.apache.spark.sql.util.ExecutionListenerManager

scala> :type spark.sessionState.listenerManager
org.apache.spark.sql.util.ExecutionListenerManager

scala> :type spark.listenerManager

org.apache.spark.sql.util.ExecutionListenerManager

scala> :type spark.sessionState.listenerManager

org.apache.spark.sql.util.ExecutionListenerManager

ExecutionListenerManager takes a single SparkConf when created

While created, ExecutionListenerManager reads spark.sql.queryExecutionListeners configuration property with QueryExecutionListeners and registers them.

ExecutionListenerManager uses spark.sql.queryExecutionListeners configuration property as the list of QueryExecutionListeners that should be automatically added to newly created sessions (and registers them while being created).

Table 1. ExecutionListenerManager’s Public Methods

Method

Description



register(listener: QueryExecutionListener): Unit

unregister



unregister(listener: QueryExecutionListener): Unit

unregister(listener: QueryExecutionListener): Unit

clear



clear(): Unit

clear(): Unit

ExecutionListenerManager is created exclusively when BaseSessionStateBuilder is requested for ExecutionListenerManager (while SessionState is built).

ExecutionListenerManager uses listeners internal registry for registered QueryExecutionListeners.

`onSuccess` Internal Method



onSuccess(funcName: String, qe: QueryExecution, duration: Long): Unit

onSuccess(funcName: String, qe: QueryExecution, duration: Long): Unit

onSuccess…FIXME

Note	`onSuccess` is used when: `DataFrameWriter` is requested to run a logical command (after it has finished with no exceptions) `Dataset` is requested to withAction

`onFailure` Internal Method



onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit

onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit

onFailure…FIXME

Note	`onFailure` is used when: `DataFrameWriter` is requested to run a logical command (after it has reported an exception) `Dataset` is requested to withAction

`withErrorHandling` Internal Method



withErrorHandling(f: QueryExecutionListener => Unit): Unit

withErrorHandling(f: QueryExecutionListener => Unit): Unit

withErrorHandling…FIXME

Note	`withErrorHandling` is used when `ExecutionListenerManager` is requested to onSuccess and onFailure.

Registering QueryExecutionListener — `register` Method



register(listener: QueryExecutionListener): Unit

Internally, register simply registers (adds) the input QueryExecutionListener to the listeners internal registry.

CatalogImpl

2012-03-02admin阅读(3030)

CatalogImpl

CatalogImpl is the Catalog in Spark SQL that…FIXME

Figure 1. CatalogImpl uses SessionCatalog (through SparkSession)

Note	`CatalogImpl` is in `org.apache.spark.sql.internal` package.

Creating Table — `createTable` Method



createTable(
  tableName: String,
  source: String,
  schema: StructType,
  options: Map[String, String]): DataFrame

createTable(

tableName: String,

source: String,

schema: StructType,

options: Map[String, String]): DataFrame

Note	`createTable` is part of Catalog Contract to…FIXME.

createTable…FIXME

`getTable` Method



getTable(tableName: String): Table
getTable(dbName: String, tableName: String): Table

getTable(tableName: String): Table

getTable(dbName: String, tableName: String): Table

Note	`getTable` is part of Catalog Contract to…FIXME.

getTable…FIXME

`functionExists` Method

Caution

FIXME

Caching Table or View In-Memory — `cacheTable` Method



cacheTable(tableName: String): Unit

cacheTable(tableName: String): Unit

Internally, cacheTable first creates a DataFrame for the table followed by requesting CacheManager to cache it.

Note	`cacheTable` uses the session-scoped SharedState to access the `CacheManager`.

Note	`cacheTable` is part of Catalog contract.

Removing All Cached Tables From In-Memory Cache — `clearCache` Method



clearCache(): Unit

clearCache(): Unit

clearCache requests CacheManager to remove all cached tables from in-memory cache.

Note	`clearCache` is part of Catalog contract.

Creating External Table From Path — `createExternalTable` Method



createExternalTable(tableName: String, path: String): DataFrame
createExternalTable(tableName: String, path: String, source: String): DataFrame
createExternalTable(
  tableName: String,
  source: String,
  options: Map[String, String]): DataFrame
createExternalTable(
  tableName: String,
  source: String,
  schema: StructType,
  options: Map[String, String]): DataFrame

createExternalTable(tableName: String, path: String): DataFrame

createExternalTable(tableName: String, path: String, source: String): DataFrame

createExternalTable(

tableName: String,

source: String,

options: Map[String, String]): DataFrame

createExternalTable(

tableName: String,

source: String,

schema: StructType,

options: Map[String, String]): DataFrame

createExternalTable creates an external table tableName from the given path and returns the corresponding DataFrame.



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

val readmeTable = spark.catalog.createExternalTable("readme", "README.md", "text")
readmeTable: org.apache.spark.sql.DataFrame = [value: string]

scala> spark.catalog.listTables.filter(_.name == "readme").show
+------+--------+-----------+---------+-----------+
|  name|database|description|tableType|isTemporary|
+------+--------+-----------+---------+-----------+
|readme| default|       null| EXTERNAL|      false|
+------+--------+-----------+---------+-----------+

scala> sql("select count(*) as count from readme").show(false)
+-----+
|count|
+-----+
|99   |
+-----+

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

val readmeTable = spark.catalog.createExternalTable("readme", "README.md", "text")

readmeTable: org.apache.spark.sql.DataFrame = [value: string]

scala> spark.catalog.listTables.filter(_.name == "readme").show

+------+--------+-----------+---------+-----------+

+------+--------+-----------+---------+-----------+

+------+--------+-----------+---------+-----------+

scala> sql("select count(*) as count from readme").show(false)

+-----+

|count|

+-----+

|99 |

+-----+

The source input parameter is the name of the data source provider for the table, e.g. parquet, json, text. If not specified, createExternalTable uses spark.sql.sources.default setting to know the data source format.

Note	`source` input parameter must not be `hive` as it leads to a `AnalysisException`.

createExternalTable sets the mandatory path option when specified explicitly in the input parameter list.

createExternalTable parses tableName into TableIdentifier (using SparkSqlParser). It creates a CatalogTable and then executes (by toRDD) a CreateTable logical plan. The result DataFrame is a Dataset[Row] with the QueryExecution after executing SubqueryAlias logical plan and RowEncoder.

spark sql CatalogImpl createExternalTable.png

Figure 2. CatalogImpl.createExternalTable

Note	`createExternalTable` is part of Catalog contract.

Listing Tables in Database (as Dataset) — `listTables` Method



listTables(): Dataset[Table]
listTables(dbName: String): Dataset[Table]

listTables(): Dataset[Table]

listTables(dbName: String): Dataset[Table]

Note	`listTables` is part of Catalog Contract to get a list of tables in the specified database.

Internally, listTables requests SessionCatalog to list all tables in the specified dbName database and converts them to Tables.

In the end, listTables creates a Dataset with the tables.

Listing Columns of Table (as Dataset) — `listColumns` Method



listColumns(tableName: String): Dataset[Column]
listColumns(dbName: String, tableName: String): Dataset[Column]

listColumns(tableName: String): Dataset[Column]

listColumns(dbName: String, tableName: String): Dataset[Column]

Note	`listColumns` is part of Catalog Contract to…FIXME.

listColumns requests SessionCatalog for the table metadata.

listColumns takes the schema from the table metadata and creates a Column for every field (with the optional comment as the description).

In the end, listColumns creates a Dataset with the columns.

Converting TableIdentifier to Table — `makeTable` Internal Method



makeTable(tableIdent: TableIdentifier): Table

makeTable(tableIdent: TableIdentifier): Table

makeTable creates a Table using the input TableIdentifier and the table metadata (from the current SessionCatalog) if available.

Note	`makeTable` uses SparkSession to access SessionState that is then used to access SessionCatalog.

Note	`makeTable` is used when `CatalogImpl` is requested to listTables or getTable.

Creating Dataset from DefinedByConstructorParams Data — `makeDataset` Method



makeDataset[T <: DefinedByConstructorParams](
  data: Seq[T],
  sparkSession: SparkSession): Dataset[T]

makeDataset[T <: DefinedByConstructorParams](

data: Seq[T],

sparkSession: SparkSession): Dataset[T]

makeDataset creates an ExpressionEncoder (from DefinedByConstructorParams) and encodes elements of the input data to internal binary rows.

makeDataset then creates a LocalRelation logical operator. makeDataset requests SessionState to execute the plan and creates the result Dataset.

Note	`makeDataset` is used when `CatalogImpl` is requested to list databases, tables, functions and columns

Refreshing Analyzed Logical Plan of Table Query and Re-Caching It — `refreshTable` Method



refreshTable(tableName: String): Unit

refreshTable(tableName: String): Unit

Note	`refreshTable` is part of Catalog Contract to…FIXME.

refreshTable requests SessionState for the SQL parser to parse a TableIdentifier given the table name.

Note	`refreshTable` uses SparkSession to access the SessionState.

refreshTable requests SessionCatalog for the table metadata.

refreshTable then creates a DataFrame for the table name.

For a temporary or persistent VIEW table, refreshTable requests the analyzed logical plan of the DataFrame (for the table) to refresh itself.

For other types of table, refreshTable requests SessionCatalog for refreshing the table metadata (i.e. invalidating the table).

If the table has been cached, refreshTable requests CacheManager to uncache and cache the table DataFrame again.

Note	`refreshTable` uses SparkSession to access the SharedState that is used to access CacheManager.

`refreshByPath` Method



refreshByPath(resourcePath: String): Unit

refreshByPath(resourcePath: String): Unit

Note	`refreshByPath` is part of Catalog Contract to…FIXME.

refreshByPath…FIXME

`listColumns` Internal Method



listColumns(tableIdentifier: TableIdentifier): Dataset[Column]

listColumns(tableIdentifier: TableIdentifier): Dataset[Column]

listColumns…FIXME

Note	`listColumns` is used exclusively when `CatalogImpl` is requested to listColumns.

Catalog — Metastore Management Interface

2012-03-01admin阅读(1621)

Catalog — Metastore Management Interface

Catalog is the interface for managing a metastore (aka metadata catalog) of relational entities (e.g. database(s), tables, functions, table columns and temporary views).

Catalog is available using SparkSession.catalog property.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.catalog
org.apache.spark.sql.catalog.Catalog

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.catalog

org.apache.spark.sql.catalog.Catalog

Method Description

cacheTable



cacheTable(tableName: String): Unit
cacheTable(tableName: String, storageLevel: StorageLevel): Unit

cacheTable(tableName: String): Unit

cacheTable(tableName: String, storageLevel: StorageLevel): Unit

Caches the specified table in memory

Used for SQL’s CACHE TABLE and AlterTableRenameCommand command.

clearCache



clearCache(): Unit

clearCache(): Unit

createTable



createTable(tableName: String, path: String): DataFrame
createTable(
  tableName: String,
  source: String,
  options: java.util.Map[String, String]): DataFrame
createTable(
  tableName: String,
  source: String,
  options: Map[String, String]): DataFrame
createTable(tableName: String, path: String, source: String): DataFrame
createTable(
  tableName: String,
  source: String,
  schema: StructType,
  options: java.util.Map[String, String]): DataFrame
createTable(
  tableName: String,
  source: String,
  schema: StructType,
  options: Map[String, String]): DataFrame

createTable(tableName: String, path: String): DataFrame

createTable(

tableName: String,

source: String,

options: java.util.Map[String, String]): DataFrame

createTable(

tableName: String,

source: String,

options: Map[String, String]): DataFrame

createTable(tableName: String, path: String, source: String): DataFrame

createTable(

tableName: String,

source: String,

schema: StructType,

options: java.util.Map[String, String]): DataFrame

createTable(

tableName: String,

source: String,

schema: StructType,

options: Map[String, String]): DataFrame

currentDatabase



currentDatabase: String

currentDatabase: String

databaseExists



databaseExists(dbName: String): Boolean

databaseExists(dbName: String): Boolean

dropGlobalTempView



dropGlobalTempView(viewName: String): Boolean

dropGlobalTempView(viewName: String): Boolean

dropTempView



dropTempView(viewName: String): Boolean

dropTempView(viewName: String): Boolean

functionExists



functionExists(functionName: String): Boolean
functionExists(dbName: String, functionName: String): Boolean

functionExists(functionName: String): Boolean

functionExists(dbName: String, functionName: String): Boolean

getDatabase



getDatabase(dbName: String): Database

getDatabase(dbName: String): Database

getFunction



getFunction(functionName: String): Function
getFunction(dbName: String, functionName: String): Function

getFunction(functionName: String): Function

getFunction(dbName: String, functionName: String): Function

getTable



getTable(tableName: String): Table
getTable(dbName: String, tableName: String): Table

getTable(tableName: String): Table

getTable(dbName: String, tableName: String): Table

isCached



isCached(tableName: String): Boolean

isCached(tableName: String): Boolean

listColumns



listColumns(tableName: String): Dataset[Column]
listColumns(dbName: String, tableName: String): Dataset[Column]

listColumns(tableName: String): Dataset[Column]

listColumns(dbName: String, tableName: String): Dataset[Column]

listDatabases



listDatabases(): Dataset[Database]

listDatabases(): Dataset[Database]

listFunctions



listFunctions(): Dataset[Function]
listFunctions(dbName: String): Dataset[Function]

listFunctions(): Dataset[Function]

listFunctions(dbName: String): Dataset[Function]

listTables



listTables(): Dataset[Table]
listTables(dbName: String): Dataset[Table]

listTables(): Dataset[Table]

listTables(dbName: String): Dataset[Table]

recoverPartitions



recoverPartitions(tableName: String): Unit

recoverPartitions(tableName: String): Unit

refreshByPath



refreshByPath(path: String): Unit

refreshByPath(path: String): Unit

refreshTable



refreshTable(tableName: String): Unit

refreshTable(tableName: String): Unit

setCurrentDatabase



setCurrentDatabase(dbName: String): Unit

setCurrentDatabase(dbName: String): Unit

tableExists



tableExists(tableName: String): Boolean
tableExists(dbName: String, tableName: String): Boolean

tableExists(tableName: String): Boolean

tableExists(dbName: String, tableName: String): Boolean

uncacheTable



uncacheTable(tableName: String): Unit

uncacheTable(tableName: String): Unit

Note	CatalogImpl is the one and only known implementation of the Catalog Contract in Apache Spark.

Configuration Properties

2012-02-29admin阅读(2345)

Configuration Properties

Configuration properties (aka settings) allow you to fine-tune a Spark SQL application.

You can set a configuration property in a SparkSession while creating a new instance using config method.



import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .master("local[*]")
  .appName("My Spark Application")
  .config("spark.sql.warehouse.dir", "c:/Temp") (1)
  .getOrCreate

import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession.builder

.master("local[*]")

.appName("My Spark Application")

.config("spark.sql.warehouse.dir", "c:/Temp") (1)

.getOrCreate

Sets spark.sql.warehouse.dir for the Spark SQL session

You can also set a property using SQL SET command.



scala> spark.conf.getOption("spark.sql.hive.metastore.version")
res1: Option[String] = None

scala> spark.sql("SET spark.sql.hive.metastore.version=2.3.2").show(truncate = false)
+--------------------------------+-----+
|key                             |value|
+--------------------------------+-----+
|spark.sql.hive.metastore.version|2.3.2|
+--------------------------------+-----+

scala> spark.conf.get("spark.sql.hive.metastore.version")
res2: String = 2.3.2

scala> spark.conf.getOption("spark.sql.hive.metastore.version")

res1: Option[String] = None

scala> spark.sql("SET spark.sql.hive.metastore.version=2.3.2").show(truncate = false)

+--------------------------------+-----+

|key |value|

+--------------------------------+-----+

|spark.sql.hive.metastore.version|2.3.2|

+--------------------------------+-----+

scala> spark.conf.get("spark.sql.hive.metastore.version")

res2: String = 2.3.2

Name Default Description

spark.sql.adaptive.enabled

false

Enables adaptive query execution

Use SQLConf.adaptiveExecutionEnabled method to access the current value.

spark.sql.allowMultipleContexts

true

Controls whether creating multiple SQLContexts/HiveContexts is allowed

spark.sql.autoBroadcastJoinThreshold

10L * 1024 * 1024 (10M)

Maximum size (in bytes) for a table that will be broadcast to all worker nodes when performing a join.

If the size of the statistics of the logical plan of a table is at most the setting, the DataFrame is broadcast for join.

Negative values or 0 disable broadcasting.

Use SQLConf.autoBroadcastJoinThreshold method to access the current value.

spark.sql.avro.compression.codec

snappy

The compression codec to use when writing Avro data to disk

The supported codecs are:

uncompressed
deflate
snappy
bzip2
xz

Use SQLConf.avroCompressionCodec method to access the current value.

spark.sql.broadcastTimeout

5 * 60

Timeout in seconds for the broadcast wait time in broadcast joins.

When negative, it is assumed infinite (i.e. Duration.Inf)

Use SQLConf.broadcastTimeout method to access the current value.

spark.sql.caseSensitive

false

(internal) Controls whether the query analyzer should be case sensitive (true) or not (false). It is highly discouraged to turn on case sensitive mode.

Use SQLConf.caseSensitiveAnalysis method to access the current value.

spark.sql.cbo.enabled

false

Enables cost-based optimization (CBO) for estimation of plan statistics when true.

Use SQLConf.cboEnabled method to access the current value.

spark.sql.cbo.joinReorder.enabled

false

Enables join reorder for cost-based optimization (CBO).

Use SQLConf.joinReorderEnabled method to access the current value.

spark.sql.cbo.starSchemaDetection

false

Enables join reordering based on star schema detection for cost-based optimization (CBO) in ReorderJoin logical plan optimization.

Use SQLConf.starSchemaDetection method to access the current value.

spark.sql.codegen.comments

false

Controls whether CodegenContext should register comments (true) or not (false).

spark.sql.codegen.factoryMode

FALLBACK

(internal) Determines the codegen generator fallback behavior

Acceptable values:

CODEGEN_ONLY – disable fallback mode
FALLBACK – try codegen first and, if any compile error happens, fallback to interpreted mode
NO_CODEGEN – skips codegen and always uses interpreted path

Used when CodeGeneratorWithInterpretedFallback is requested to createObject (when UnsafeProjection is requested to create an UnsafeProjection for Catalyst expressions)

spark.sql.codegen.fallback

true

(internal) Whether the whole stage codegen could be temporary disabled for the part of a query that has failed to compile generated code (true) or not (false).

Use SQLConf.wholeStageFallback method to access the current value.

spark.sql.codegen.hugeMethodLimit

65535

(internal) The maximum bytecode size of a single compiled Java function generated by whole-stage codegen.

The default value 65535 is the largest bytecode size possible for a valid Java method. When running on HotSpot, it may be preferable to set the value to 8000 (which is the value of HugeMethodLimit in the OpenJDK JVM settings)

Use SQLConf.hugeMethodLimit method to access the current value.

spark.sql.codegen.useIdInClassName

true

(internal) Controls whether to embed the (whole-stage) codegen stage ID into the class name of the generated class as a suffix (true) or not (false)

Use SQLConf.wholeStageUseIdInClassName method to access the current value.

spark.sql.codegen.maxFields

100

(internal) Maximum number of output fields (including nested fields) that whole-stage codegen supports. Going above the number deactivates whole-stage codegen.

Use SQLConf.wholeStageMaxNumFields method to access the current value.

spark.sql.codegen.splitConsumeFuncByOperator

true

(internal) Controls whether whole stage codegen puts the logic of consuming rows of each physical operator into individual methods, instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.

Use SQLConf.wholeStageSplitConsumeFuncByOperator method to access the current value.

spark.sql.codegen.wholeStage

true

(internal) Whether the whole stage (of multiple physical operators) will be compiled into a single Java method (true) or not (false).

Use SQLConf.wholeStageEnabled method to access the current value.

spark.sql.columnVector.offheap.enabled

false

(internal) Enables OffHeapColumnVector in ColumnarBatch (true) or not (false). When disabled, OnHeapColumnVector is used instead.

Use SQLConf.offHeapColumnVectorEnabled method to access the current value.

spark.sql.columnNameOfCorruptRecord

spark.sql.defaultSizeInBytes

Java’s Long.MaxValue

(internal) Estimated size of a table or relation used in query planning

Set to Java’s Long.MaxValue which is larger than spark.sql.autoBroadcastJoinThreshold to be more conservative. That is to say by default the optimizer will not choose to broadcast a table unless it knows for sure that the table size is small enough.

Used by the planner to decide when it is safe to broadcast a relation. By default, the system will assume that tables are too large to broadcast.

Use SQLConf.defaultSizeInBytes method to access the current value.

spark.sql.dialect

spark.sql.exchange.reuse

true

(internal) When enabled (i.e. true), the Spark planner will find duplicated exchanges and subqueries and re-use them.

Note	When disabled (i.e. `false`), ReuseSubquery and ReuseExchange physical optimizations (that the Spark planner uses for physical query plan optimization) do nothing.

Use SQLConf.exchangeReuseEnabled method to access the current value.

spark.sql.execution.useObjectHashAggregateExec

true

Enables ObjectHashAggregateExec when Aggregation execution planning strategy is executed.

Use SQLConf.useObjectHashAggregation method to access the current value.

spark.sql.files.ignoreCorruptFiles

false

Controls whether to ignore corrupt files (true) or not (false). If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.

Use SQLConf.ignoreCorruptFiles method to access the current value.

spark.sql.files.ignoreMissingFiles

false

Controls whether to ignore missing files (true) or not (false). If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned.

Use SQLConf.ignoreMissingFiles method to access the current value.

spark.sql.hive.convertMetastoreOrc

true

(internal) When enabled (i.e. true), the built-in ORC reader and writer are used to process ORC tables created by using the HiveQL syntax (instead of Hive serde).

spark.sql.hive.convertMetastoreParquet

true

Controls whether to use the built-in Parquet reader and writer to process parquet tables created by using the HiveQL syntax (instead of Hive serde).

spark.sql.hive.convertMetastoreParquet.mergeSchema

false

Enables trying to merge possibly different but compatible Parquet schemas in different Parquet data files.

This configuration is only effective when spark.sql.hive.convertMetastoreParquet is enabled.

spark.sql.hive.manageFilesourcePartitions

true

Enables metastore partition management for file source tables. This includes both datasource and converted Hive tables.

When enabled (true), datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning.

Use SQLConf.manageFilesourcePartitions method to access the current value.

spark.sql.hive.metastore.barrierPrefixes

(empty)

Comma-separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with, e.g. Hive UDFs that are declared in a prefix that typically would be shared (i.e. org.apache.spark.*)

spark.sql.hive.metastore.jars

builtin

Location of the jars that should be used to create a HiveClientImpl.

Supported locations:

builtin – the jars that were used to load Spark SQL (aka Spark classes). Valid only when using the execution version of Hive, i.e. spark.sql.hive.metastore.version
maven – download the Hive jars from Maven repositories
Classpath in the standard format for both Hive and Hadoop

spark.sql.hive.metastore.sharedPrefixes

"com.mysql.jdbc", "org.postgresql", "com.microsoft.sqlserver", "oracle.jdbc"

Comma-separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive.

An example of classes that should be shared are:

JDBC drivers that are needed to talk to the metastore
Other classes that interact with classes that are already shared, e.g. custom appenders that are used by log4j

spark.sql.hive.metastore.version

1.2.1

Version of the Hive metastore (and the client classes and jars).

Supported versions range from 0.12.0 up to and including 2.3.2.

spark.sql.inMemoryColumnarStorage.batchSize

10000

(internal) Controls…FIXME

Use SQLConf.columnBatchSize method to access the current value.

spark.sql.inMemoryColumnarStorage.compressed

true

(internal) Controls…FIXME

Use SQLConf.useCompression method to access the current value.

spark.sql.inMemoryColumnarStorage.enableVectorizedReader

true

Enables vectorized reader for columnar caching.

Use SQLConf.cacheVectorizedReaderEnabled method to access the current value.

spark.sql.inMemoryColumnarStorage.partitionPruning

true

(internal) Enables partition pruning for in-memory columnar tables

Use SQLConf.inMemoryPartitionPruning method to access the current value.

spark.sql.join.preferSortMergeJoin

true

(internal) Controls whether JoinSelection execution planning strategy prefers sort merge join over shuffled hash join.

Use SQLConf.preferSortMergeJoin method to access the current value.

spark.sql.limit.scaleUpFactor

4

(internal) Minimal increase rate in the number of partitions between attempts when executing take operator on a structured query. Higher values lead to more partitions read. Lower values might lead to longer execution times as more jobs will be run.

Use SQLConf.limitScaleUpFactor method to access the current value.

spark.sql.optimizer.excludedRules

(empty)

Comma-separated list of optimization rule names that should be disabled (excluded) in the optimizer. The optimizer will log the rules that have indeed been excluded.

Note	It is not guaranteed that all the rules in this configuration will eventually be excluded, as some rules are necessary for correctness.

Use SQLConf.optimizerExcludedRules method to access the current value.

spark.sql.optimizer.inSetConversionThreshold

10

(internal) The threshold of set size for InSet conversion.

Use SQLConf.optimizerInSetConversionThreshold method to access the current value.

spark.sql.optimizer.maxIterations

100

Maximum number of iterations for Analyzer and Optimizer.

spark.sql.orc.impl

native

(internal) When native, use the native version of ORC support instead of the ORC library in Hive 1.2.1.

Acceptable values:

hive
native

spark.sql.parquet.binaryAsString

false

Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.

Use SQLConf.isParquetBinaryAsString method to access the current value.

spark.sql.parquet.int96AsTimestamp

true

Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

Use SQLConf.isParquetINT96AsTimestamp method to access the current value.

spark.sql.parquet.enableVectorizedReader

true

Enables vectorized parquet decoding.

Use SQLConf.parquetVectorizedReaderEnabled method to access the current value.

spark.sql.parquet.filterPushdown

true

Controls the filter predicate push-down optimization for data sources using parquet file format

Use SQLConf.parquetFilterPushDown method to access the current value.

spark.sql.parquet.int96TimestampConversion

false

Controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala. This is necessary because Impala stores INT96 data with a different timezone offset than Hive and Spark.

Use SQLConf.isParquetINT96TimestampConversion method to access the current value.

spark.sql.parquet.recordLevelFilter.enabled

false

Enables Parquet’s native record-level filtering using the pushed down filters.

Note	This configuration only has an effect when spark.sql.parquet.filterPushdown is enabled (and it is by default).

Use SQLConf.parquetRecordFilterEnabled method to access the current value.

spark.sql.parser.quotedRegexColumnNames

false

Controls whether quoted identifiers (using backticks) in SELECT statements should be interpreted as regular expressions.

Use SQLConf.supportQuotedRegexColumnName method to access the current value.

spark.sql.sort.enableRadixSort

true

(internal) Controls whether to use radix sort (true) or not (false) in ShuffleExchangeExec and SortExec physical operators

Radix sort is much faster but requires additional memory to be reserved up-front. The memory overhead may be significant when sorting very small rows (up to 50% more).

Use SQLConf.enableRadixSort method to access the current value.

spark.sql.sources.commitProtocolClass

SQLHadoopMapReduceCommitProtocol

(internal)

Use SQLConf.fileCommitProtocolClass method to access the current value.

spark.sql.sources.partitionOverwriteMode

static

Enables dynamic partition inserts when dynamic

When INSERT OVERWRITE a partitioned data source table with dynamic partition columns, Spark SQL supports two modes (case-insensitive):

static – Spark deletes all the partitions that match the partition specification (e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting
dynamic – Spark doesn’t delete partitions ahead, and only overwrites those partitions that have data written into it

The default (STATIC) is to keep the same behavior of Spark prior to 2.3. Note that this config doesn’t affect Hive serde tables, as they are always overwritten with dynamic mode.

Use SQLConf.partitionOverwriteMode method to access the current value.

spark.sql.pivotMaxValues

10000

Maximum number of (distinct) values that will be collected without error (when doing a pivot without specifying the values for the pivot column)

Use SQLConf.dataFramePivotMaxValues method to access the current value.

spark.sql.redaction.options.regex

(?i)secret!password

Regular expression to find options of a Spark SQL command with sensitive information

The values of the options matched will be redacted in the explain output.

This redaction is applied on top of the global redaction configuration defined by spark.redaction.regex configuration.

Used exclusively when SQLConf is requested to redactOptions.

spark.sql.redaction.string.regex

(undefined)

Regular expression to point at sensitive information in text output

When this regex matches a string part, that string part is replaced by a dummy value (i.e. ***(redacted)). This is currently used to redact the output of SQL explain commands.

Note	When this conf is not set, the value of `spark.redaction.string.regex` is used instead.

Use SQLConf.stringRedactionPattern method to access the current value.

spark.sql.retainGroupColumns

true

Controls whether to retain columns used for aggregation or not (in RelationalGroupedDataset operators).

Use SQLConf.dataFrameRetainGroupColumns method to access the current value.

spark.sql.runSQLOnFiles

true

(internal) Controls whether Spark SQL could use datasource.path as a table in a SQL query.

Use SQLConf.runSQLonFile method to access the current value.

spark.sql.selfJoinAutoResolveAmbiguity

true

Controls whether to resolve ambiguity in join conditions for self-joins automatically.

spark.sql.session.timeZone

Java’s TimeZone.getDefault.getID

The ID of session-local timezone, e.g. “GMT”, “America/Los_Angeles”, etc.

Use SQLConf.sessionLocalTimeZone method to access the current value.

spark.sql.shuffle.partitions

200

Number of partitions to use by default when shuffling data for joins or aggregations

Corresponds to Apache Hive’s mapred.reduce.tasks property that Spark considers deprecated.

Use SQLConf.numShufflePartitions method to access the current value.

spark.sql.sources.bucketing.enabled

true

Enables bucketing support. When disabled (i.e. false), bucketed tables are considered regular (non-bucketed) tables.

Use SQLConf.bucketingEnabled method to access the current value.

spark.sql.sources.default

parquet

Defines the default data source to use for DataFrameReader.

Used when:

Reading (DataFrameWriter) or writing (DataFrameReader) datasets
Creating external table from a path (in Catalog.createExternalTable)
Reading (DataStreamReader) or writing (DataStreamWriter) in Structured Streaming

spark.sql.statistics.fallBackToHdfs

false

Enables automatic calculation of table size statistic by falling back to HDFS if the table statistics are not available from table metadata.

This can be useful in determining if a table is small enough for auto broadcast joins in query planning.

Use SQLConf.fallBackToHdfsForStatsEnabled method to access the current value.

spark.sql.statistics.histogram.enabled

false

Enables generating histograms when computing column statistics

Note

Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan.

Use SQLConf.histogramEnabled method to access the current value.

spark.sql.statistics.histogram.numBins

254

(internal) The number of bins when generating histograms.

Note	The number of bins must be greater than 1.

Use SQLConf.histogramNumBins method to access the current value.

spark.sql.statistics.parallelFileListingInStatsComputation.enabled

true

(internal) Enables parallel file listing in SQL commands, e.g. ANALYZE TABLE (as opposed to single thread listing that can be particularly slow with tables with hundreds of partitions)

Use SQLConf.parallelFileListingInStatsComputation method to access the current value.

spark.sql.statistics.size.autoUpdate.enabled

false

Enables automatic update of the table size statistic of a table after the table has changed.

Important

If the total number of files of the table is very large this can be expensive and slow down data change commands.

Use SQLConf.autoSizeUpdateEnabled method to access the current value.

spark.sql.subexpressionElimination.enabled

true

(internal) Enables subexpression elimination

Use subexpressionEliminationEnabled method to access the current value.

spark.sql.TungstenAggregate.testFallbackStartsAt

(empty)

A comma-separated pair of numbers, e.g. 5,10, that HashAggregateExec uses to inform TungstenAggregationIterator to switch to a sort-based aggregation when the hash-based approach is unable to acquire enough memory.

spark.sql.ui.retainedExecutions

1000

The number of SQLExecutionUIData entries to keep in failedExecutions and completedExecutions internal registries.

When a query execution finishes, the execution is removed from the internal activeExecutions registry and stored in failedExecutions or completedExecutions given the end execution status. It is when SQLListener makes sure that the number of SQLExecutionUIData entires does not exceed spark.sql.ui.retainedExecutions Spark property and removes the excess of entries.

spark.sql.windowExec.buffer.in.memory.threshold

4096

(internal) Threshold for number of rows guaranteed to be held in memory by WindowExec physical operator.

Use windowExecBufferInMemoryThreshold method to access the current value.

spark.sql.windowExec.buffer.spill.threshold

4096

(internal) Threshold for number of rows buffered in a WindowExec physical operator.

Use windowExecBufferSpillThreshold method to access the current value.

上一页
1
···
49
50
51
52
53
54
55
...
下一页
共 58 页

spark-sql 第52页

GlobalTempViewManager — Management Interface of Global Temporary Views

clear Method

Creating (Registering) Global Temporary View (Definition) — create Method

Retrieving Global View Definition Per Name — get Method

Listing Global Temporary Views For Pattern — listViewNames Method

Removing (De-Registering) Global Temporary View — remove Method

rename Method

update Method

FunctionRegistry — Contract for Function Registries (Catalogs)

expression Internal Method

SimpleFunctionRegistry

createOrReplaceTempFunction Final Method

functionExists Method

HiveExternalCatalog — Hive-Aware Metastore of Permanent Relational Entities

getRawTable Method

doAlterTableStats Method

Converting Table Statistics to Properties — statsToProperties Internal Method

Restoring Table Statistics from Properties (from Hive Metastore) — statsFromProperties Internal Method

listPartitionsByFilter Method

alterPartitions Method

getTable Method

doAlterTable Method

restorePartitionMetadata Internal Method

getPartition Method

getPartitionOption Method

Creating HiveExternalCatalog Instance

Building Property Name for Column and Statistic Key — columnStatKeyPropName Internal Method

getBucketSpecFromTableProperties Internal Method

Restoring Hive Serde Table — restoreHiveSerdeTable Internal Method

Restoring Data Source Table — restoreDataSourceTable Internal Method

restoreTableMetadata Internal Method

Retrieving CatalogTablePartition of Table — listPartitions Method

doCreateTable Method

tableMetaToTableProps Internal Method

doAlterTableDataSchema Method

createDataSourceTable Internal Method

InMemoryCatalog

listPartitionsByFilter Method

ExternalCatalog Contract — External Catalog (Metastore) of Permanent Relational Entities

Altering Table Statistics — alterTableStats Method

Altering Table — alterTable Method

createTable Method

alterTableDataSchema Method

ExperimentalMethods

Example

ExecutionListenerManager — Management Interface of QueryExecutionListeners

onSuccess Internal Method

onFailure Internal Method

withErrorHandling Internal Method

Registering QueryExecutionListener — register Method

CatalogImpl

Creating Table — createTable Method

getTable Method

functionExists Method

Caching Table or View In-Memory — cacheTable Method

Removing All Cached Tables From In-Memory Cache — clearCache Method

Creating External Table From Path — createExternalTable Method

Listing Tables in Database (as Dataset) — listTables Method

Listing Columns of Table (as Dataset) — listColumns Method

Converting TableIdentifier to Table — makeTable Internal Method

Creating Dataset from DefinedByConstructorParams Data — makeDataset Method

Refreshing Analyzed Logical Plan of Table Query and Re-Caching It — refreshTable Method

refreshByPath Method

listColumns Internal Method

Catalog — Metastore Management Interface

Configuration Properties

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

`clear` Method

Creating (Registering) Global Temporary View (Definition) — `create` Method

Retrieving Global View Definition Per Name — `get` Method

Listing Global Temporary Views For Pattern — `listViewNames` Method

Removing (De-Registering) Global Temporary View — `remove` Method

`rename` Method

`update` Method

`expression` Internal Method

`createOrReplaceTempFunction` Final Method

`functionExists` Method

`getRawTable` Method

`doAlterTableStats` Method

Converting Table Statistics to Properties — `statsToProperties` Internal Method

Restoring Table Statistics from Properties (from Hive Metastore) — `statsFromProperties` Internal Method

`listPartitionsByFilter` Method

`alterPartitions` Method

`getTable` Method

`doAlterTable` Method

`restorePartitionMetadata` Internal Method

`getPartition` Method

`getPartitionOption` Method

Building Property Name for Column and Statistic Key — `columnStatKeyPropName` Internal Method

`getBucketSpecFromTableProperties` Internal Method

Restoring Hive Serde Table — `restoreHiveSerdeTable` Internal Method

Restoring Data Source Table — `restoreDataSourceTable` Internal Method

`restoreTableMetadata` Internal Method

Retrieving CatalogTablePartition of Table — `listPartitions` Method

`doCreateTable` Method

`tableMetaToTableProps` Internal Method

`doAlterTableDataSchema` Method

`createDataSourceTable` Internal Method

`listPartitionsByFilter` Method

Altering Table Statistics — `alterTableStats` Method

Altering Table — `alterTable` Method

`createTable` Method

`alterTableDataSchema` Method

`onSuccess` Internal Method

`onFailure` Internal Method

`withErrorHandling` Internal Method

Registering QueryExecutionListener — `register` Method

Creating Table — `createTable` Method

`getTable` Method

`functionExists` Method

Caching Table or View In-Memory — `cacheTable` Method

Removing All Cached Tables From In-Memory Cache — `clearCache` Method

Creating External Table From Path — `createExternalTable` Method

Listing Tables in Database (as Dataset) — `listTables` Method

Listing Columns of Table (as Dataset) — `listColumns` Method

Converting TableIdentifier to Table — `makeTable` Internal Method

Creating Dataset from DefinedByConstructorParams Data — `makeDataset` Method

Refreshing Analyzed Logical Plan of Table Query and Re-Caching It — `refreshTable` Method

`refreshByPath` Method

`listColumns` Internal Method