关注 spark技术分享,
撸spark源码 玩spark最佳实践

SessionStateBuilder

admin阅读(3661)

SessionStateBuilder

SessionStateBuilder is…​FIXME

BaseSessionStateBuilder — Generic Builder of SessionState

admin阅读(4581)

BaseSessionStateBuilder — Generic Builder of SessionState

BaseSessionStateBuilder is the contract of builder objects that coordinate construction of a new SessionState.

Table 1. BaseSessionStateBuilders
BaseSessionStateBuilder Description

SessionStateBuilder

HiveSessionStateBuilder

BaseSessionStateBuilder is created when SparkSession is requested for a SessionState.


BaseSessionStateBuilder requires that implementations define newBuilder method that SparkSession uses (indirectly) when requested for the SessionState (per spark.sql.catalogImplementation internal configuration property).

Note
BaseSessionStateBuilder and spark.sql.catalogImplementation configuration property allow for Hive and non-Hive Spark deployments.

BaseSessionStateBuilder holds properties that (together with newBuilder) are used to create a SessionState.

Table 2. BaseSessionStateBuilder’s Properties
Name Description

analyzer

Logical analyzer

catalog

Used to create Analyzer, Optimizer and a SessionState itself

Note
HiveSessionStateBuilder manages its own Hive-aware HiveSessionCatalog.

conf

SQLConf

customOperatorOptimizationRules

Custom operator optimization rules to add to the base Operator Optimization batch.

When requested for the custom rules, customOperatorOptimizationRules simply requests the SparkSessionExtensions to buildOptimizerRules.

experimentalMethods

ExperimentalMethods

extensions

SparkSessionExtensions

functionRegistry

FunctionRegistry

listenerManager

ExecutionListenerManager

optimizer

SparkOptimizer (that is downcast to the base Optimizer) that is created with the SessionCatalog and the ExperimentalMethods.

Note that the SparkOptimizer adds the customOperatorOptimizationRules to the operator optimization rules.

optimizer is used when BaseSessionStateBuilder is requested to create a SessionState (for the optimizerBuilder function to create an Optimizer when requested for the Optimizer).

planner

SparkPlanner

resourceLoader

SessionResourceLoader

sqlParser

ParserInterface

streamingQueryManager

Spark Structured Streaming’s StreamingQueryManager

udfRegistration

UDFRegistration

Note

BaseSessionStateBuilder defines a type alias NewBuilder for a function to create a BaseSessionStateBuilder.

Note
BaseSessionStateBuilder is an experimental and unstable API.

Creating Function to Build SessionState — createClone Method

createClone gives a function of SparkSession and SessionState that executes newBuilder followed by build.

Note
createClone is used exclusively when BaseSessionStateBuilder is requested for a SessionState

Creating SessionState Instance — build Method

build creates a SessionState with the following:

Note

build is used when:

Creating BaseSessionStateBuilder Instance

BaseSessionStateBuilder takes the following when created:

Getting Function to Create QueryExecution For LogicalPlan — createQueryExecution Method

createQueryExecution simply returns a function that takes a LogicalPlan and creates a QueryExecution with the SparkSession and the logical plan.

Note
createQueryExecution is used exclusively when BaseSessionStateBuilder is requested to create a SessionState instance.

SessionState

admin阅读(4674)

SessionState — State Separation Layer Between SparkSessions

SessionState is the state separation layer between Spark SQL sessions, including SQL configuration, tables, functions, UDFs, SQL parser, and everything else that depends on a SQLConf.

SessionState is available as the sessionState property of a SparkSession.

SessionState is created when SparkSession is requested to instantiateSessionState (when requested for the SessionState per spark.sql.catalogImplementation configuration property).

spark sql SessionState.png
Figure 1. Creating SessionState
Note

When requested for the SessionState, SparkSession uses spark.sql.catalogImplementation configuration property to load and create a BaseSessionStateBuilder that is then requested to create a SessionState instance.

There are two BaseSessionStateBuilders available:

hive catalog is set when the SparkSession was created with the Hive support enabled (using Builder.enableHiveSupport).

Table 1. SessionState’s (Lazily-Initialized) Attributes
Name Type Description

analyzer

Analyzer

Spark Analyzer

Initialized lazily (i.e. only when requested the first time) using the analyzerBuilder factory function.

Used when…​FIXME

catalog

SessionCatalog

Metastore of tables and databases

Used when…​FIXME

conf

SQLConf

FIXME

Used when…​FIXME

experimentalMethods

ExperimentalMethods

FIXME

Used when…​FIXME

functionRegistry

FunctionRegistry

FIXME

Used when…​FIXME

functionResourceLoader

FunctionResourceLoader

FIXME

Used when…​FIXME

listenerManager

ExecutionListenerManager

FIXME

Used when…​FIXME

optimizer

Optimizer

Logical query plan optimizer

Used exclusively when QueryExecution creates an optimized logical plan.

resourceLoader

SessionResourceLoader

FIXME

Used when…​FIXME

sqlParser

ParserInterface

FIXME

Used when…​FIXME

streamingQueryManager

StreamingQueryManager

Used to manage streaming queries in Spark Structured Streaming

udfRegistration

UDFRegistration

Interface to register user-defined functions.

Used when…​FIXME

Note
SessionState is a private[sql] class and, given the package org.apache.spark.sql.internal, SessionState should be considered internal.

Creating SessionState Instance

SessionState takes the following when created:

apply Factory Methods

Caution
FIXME

  1. Passes sparkSession to the other apply with a new SQLConf

Note
apply is used when SparkSession is requested for SessionState.

clone Method

Caution
FIXME
Note
clone is used when…​

createAnalyzer Internal Method

createAnalyzer creates a logical query plan Analyzer with rules specific to a non-Hive SessionState.

Table 2. Analyzer’s Evaluation Rules for non-Hive SessionState (in the order of execution)
Method Rules Description

extendedResolutionRules

FindDataSourceTable

Replaces InsertIntoTable (with CatalogRelation) and CatalogRelation logical plans with LogicalRelation.

ResolveSQLOnFile

postHocResolutionRules

PreprocessTableCreation

PreprocessTableInsertion

DataSourceAnalysis

extendedCheckRules

PreWriteCheck

HiveOnlyCheck

Note
createAnalyzer is used when SessionState is created or cloned.

“Executing” Logical Plan (Creating QueryExecution For LogicalPlan) — executePlan Method

executePlan simply executes the createQueryExecution function on the input logical plan (that simply creates a QueryExecution with the current SparkSession and the input logical plan).

refreshTable Method

refreshTable is…​

addJar Method

addJar is…​

analyze Method

analyze is…​

Creating New Hadoop Configuration — newHadoopConf Method

newHadoopConf returns a Hadoop Configuration (with the SparkContext.hadoopConfiguration and all the configuration properties of the SQLConf).

Note
newHadoopConf is used by ScriptTransformation, ParquetRelation, StateStoreRDD, and SessionState itself, and few other places.

Creating New Hadoop Configuration With Extra Options — newHadoopConfWithOptions Method

newHadoopConfWithOptions creates a new Hadoop Configuration with the input options set (except path and paths options that are skipped).

Note

newHadoopConfWithOptions is used when:

HiveMetastoreCatalog — Legacy SessionCatalog for Converting Hive Metastore Relations to Data Source Relations

admin阅读(3718)

HiveMetastoreCatalog — Legacy SessionCatalog for Converting Hive Metastore Relations to Data Source Relations

HiveMetastoreCatalog is a legacy session-scoped catalog of relational entities that HiveSessionCatalog uses exclusively for converting Hive metastore relations to data source relations (when RelationConversions logical evaluation rule is executed).

HiveMetastoreCatalog is created exclusively when HiveSessionStateBuilder is requested for SessionCatalog (and creates a HiveSessionCatalog).

spark sql HiveMetastoreCatalog.png
Figure 1. HiveMetastoreCatalog, HiveSessionCatalog and HiveSessionStateBuilder


HiveMetastoreCatalog takes a SparkSession when created.

Converting HiveTableRelation to LogicalRelation — convertToLogicalRelation Method

convertToLogicalRelation…​FIXME

Note
convertToLogicalRelation is used exclusively when RelationConversions logical evaluation rule is requested to convert a HiveTableRelation to a LogicalRelation for parquet, native and hive ORC storage formats.

inferIfNeeded Internal Method

inferIfNeeded…​FIXME

Note
inferIfNeeded is used exclusively when HiveMetastoreCatalog is requested to convertToLogicalRelation.

HiveSessionCatalog — Hive-Specific Catalog of Relational Entities

admin阅读(1628)

HiveSessionCatalog — Hive-Specific Catalog of Relational Entities

HiveSessionCatalog is a session-scoped catalog of relational entities that is used when SparkSession was created with Hive support enabled.

spark sql HiveSessionCatalog.png
Figure 1. HiveSessionCatalog and HiveSessionStateBuilder

HiveSessionCatalog is available as catalog property of SessionState when SparkSession was created with Hive support enabled (that in the end sets spark.sql.catalogImplementation internal configuration property to hive).

HiveSessionCatalog is created exclusively when HiveSessionStateBuilder is requested for the SessionCatalog.

HiveSessionCatalog uses the legacy HiveMetastoreCatalog (which is another session-scoped catalog of relational entities) exclusively to allow RelationConversions logical evaluation rule to convert Hive metastore relations to data source relations when executed.

Creating HiveSessionCatalog Instance

HiveSessionCatalog takes the following when created:

lookupFunction0 Internal Method

lookupFunction0…​FIXME

Note
lookupFunction0 is used when…​FIXME

BucketSpec — Bucketing Specification of Table

admin阅读(1543)

BucketSpec — Bucketing Specification of Table

BucketSpec is the bucketing specification of a table, i.e. the metadata of the bucketing of a table.

BucketSpec includes the following:

  • Number of buckets

  • Bucket column names – the names of the columns used for buckets (at least one)

  • Sort column names – the names of the columns used to sort data in buckets

The number of buckets has to be between 0 and 100000 exclusive (or an AnalysisException is thrown).

BucketSpec is created when:

  1. DataFrameWriter is requested to saveAsTable (and does getBucketSpec)

  2. HiveExternalCatalog is requested to getBucketSpecFromTableProperties and tableMetaToTableProps

  3. HiveClientImpl is requested to retrieve a table metadata

  4. SparkSqlAstBuilder is requested to visitBucketSpec (for CREATE TABLE SQL statement with CLUSTERED BY and INTO n BUCKETS with optional SORTED BY clauses)

BucketSpec uses the following text representation (i.e. toString):

Converting Bucketing Specification to LinkedHashMap — toLinkedHashMap Method

toLinkedHashMap converts the bucketing specification to a collection of pairs (LinkedHashMap[String, String]) with the following fields and their values:

toLinkedHashMap quotes the column names.

Note

toLinkedHashMap is used when:

CatalogTablePartition — Partition Specification of Table

admin阅读(1467)

CatalogTablePartition — Partition Specification of Table

CatalogTablePartition is the partition specification of a table, i.e. the metadata of the partitions of a table.

CatalogTablePartition is created when:

CatalogTablePartition can hold the table statistics that…​FIXME

The readable text representation of a CatalogTablePartition (aka simpleString) is…​FIXME

Note
simpleString is used exclusively when ShowTablesCommand is executed (with a partition specification).

CatalogTablePartition uses the following text representation (i.e. toString)…​FIXME

Creating CatalogTablePartition Instance

CatalogTablePartition takes the following when created:

Converting Partition Specification to LinkedHashMap — toLinkedHashMap Method

toLinkedHashMap converts the partition specification to a collection of pairs (LinkedHashMap[String, String]) with the following fields and their values:

Note

toLinkedHashMap is used when:

location Method

location simply returns the location URI of the CatalogStorageFormat or throws an AnalysisException:

Note
location is used when…​FIXME

CatalogStorageFormat — Storage Specification of Table or Partition

admin阅读(2061)

CatalogStorageFormat — Storage Specification of Table or Partition

CatalogStorageFormat is the storage specification of a partition or a table, i.e. the metadata that includes the following:

  • Location URI

  • Input format

  • Output format

  • SerDe

  • compressed flag

  • Properties (as Map[String, String])

CatalogStorageFormat is created when:

CatalogStorageFormat uses the following text representation (i.e. toString)…​FIXME

Converting Storage Specification to LinkedHashMap — toLinkedHashMap Method

toLinkedHashMap…​FIXME

Note

toLinkedHashMap is used when:

CatalogTable — Table Specification (Native Table Metadata)

admin阅读(1786)

CatalogTable — Table Specification (Native Table Metadata)

CatalogTable is the table specification, i.e. the metadata of a table that is stored in a session-scoped catalog of relational entities (i.e. SessionCatalog).

CatalogTable is created when:

The readable text representation of a CatalogTable (aka simpleString) is…​FIXME

Note
simpleString is used exclusively when ShowTablesCommand logical command is executed (with a partition specification).

CatalogTable uses the following text representation (i.e. toString)…​FIXME

CatalogTable is created with the optional bucketing specification that is used for the following:

Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)

You manage a table metadata using the catalog interface (aka metastore). Among the management tasks is to get the statistics of a table (that are used for cost-based query optimization).

Note
The CatalogStatistics are optional when CatalogTable is created.
Caution
FIXME When are stats specified? What if there are not?

Unless CatalogStatistics are available in a table metadata (in a catalog) for a non-streaming file data source table, DataSource creates a HadoopFsRelation with the table size specified by spark.sql.defaultSizeInBytes internal property (default: Long.MaxValue) for query planning of joins (and possibly to auto broadcast the table).

Internally, Spark alters table statistics using ExternalCatalog.doAlterTableStats.

Unless CatalogStatistics are available in a table metadata (in a catalog) for HiveTableRelation (and hive provider) DetermineTableStats logical resolution rule can compute the table size using HDFS (if spark.sql.statistics.fallBackToHdfs property is turned on) or assume spark.sql.defaultSizeInBytes (that effectively disables table broadcasting).

You can use AnalyzeColumnCommand, AnalyzePartitionCommand, AnalyzeTableCommand commands to record statistics in a catalog.

The table statistics can be automatically updated (after executing commands like AlterTableAddPartitionCommand) when spark.sql.statistics.size.autoUpdate.enabled property is turned on.

You can use DESCRIBE SQL command to show the histogram of a column if stored in a catalog.

dataSchema Method

dataSchema…​FIXME

Note
dataSchema is used when…​FIXME

partitionSchema Method

partitionSchema…​FIXME

Note
partitionSchema is used when…​FIXME

Converting Table Specification to LinkedHashMap — toLinkedHashMap Method

toLinkedHashMap converts the table specification to a collection of pairs (LinkedHashMap[String, String]) with the following fields and their values:

Note

toLinkedHashMap is used when:

Creating CatalogTable Instance

CatalogTable takes the following when created:

  • TableIdentifier

  • CatalogTableType (i.e. EXTERNAL, MANAGED or VIEW)

  • CatalogStorageFormat

  • Schema

  • Name of the table provider (optional)

  • Partition column names

  • Optional Bucketing specification (default: None)

  • Owner

  • Create time

  • Last access time

  • Create version

  • Properties

  • Optional table statistics

  • Optional view text

  • Optional comment

  • Unsupported features

  • tracksPartitionsInCatalog flag

  • schemaPreservesCase flag

  • Ignored properties

database Method

database simply returns the database (of the TableIdentifier) or throws an AnalysisException:

Note
database is used when…​FIXME

SessionCatalog — Session-Scoped Catalog of Relational Entities

admin阅读(1567)

SessionCatalog — Session-Scoped Catalog of Relational Entities

SessionCatalog is the catalog (registry) of relational entities, i.e. databases, tables, views, partitions, and functions (in a SparkSession).

spark sql SessionCatalog.png
Figure 1. SessionCatalog and Spark SQL Services

SessionCatalog uses the ExternalCatalog for the metadata of permanent entities (i.e. tables).

Note
SessionCatalog is a layer over ExternalCatalog in a SparkSession which allows for different metastores (i.e. in-memory or hive) to be used.

SessionCatalog is available through SessionState (of a SparkSession).

SessionCatalog is created when BaseSessionStateBuilder is requested for the SessionCatalog (when SessionState is requested for it).

Amongst the notable usages of SessionCatalog is to create an Analyzer or a SparkOptimizer.

Table 1. SessionCatalog’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

currentDb

FIXME

Used when…​FIXME

tableRelationCache

A cache of fully-qualified table names to table relation plans (i.e. LogicalPlan).

Used when SessionCatalog refreshes a table

tempViews

Registry of temporary views (i.e. non-global temporary tables)

requireTableExists Internal Method

requireTableExists…​FIXME

Note
requireTableExists is used when…​FIXME

databaseExists Method

databaseExists…​FIXME

Note
databaseExists is used when…​FIXME

listTables Method

  1. Uses "*" as the pattern

listTables…​FIXME

Note

listTables is used when:

  • ShowTablesCommand logical command is requested to run

  • SessionCatalog is requested to reset (for testing)

  • CatalogImpl is requested to listTables (for testing)

Checking Whether Table Is Temporary View — isTemporaryTable Method

isTemporaryTable…​FIXME

Note
isTemporaryTable is used when…​FIXME

alterPartitions Method

alterPartitions…​FIXME

Note
alterPartitions is used when…​FIXME

listPartitions Method

listPartitions…​FIXME

Note
listPartitions is used when…​FIXME

alterTable Method

alterTable…​FIXME

Note
alterTable is used when AlterTableSetPropertiesCommand, AlterTableUnsetPropertiesCommand, AlterTableChangeColumnCommand, AlterTableSerDePropertiesCommand, AlterTableRecoverPartitionsCommand, AlterTableSetLocationCommand, AlterViewAsCommand (for permanent views) logical commands are executed.

Altering Table Statistics in Metastore (and Invalidating Internal Cache) — alterTableStats Method

alterTableStats requests ExternalCatalog to alter the statistics of the table (per identifier) followed by invalidating the table relation cache.

alterTableStats reports a NoSuchDatabaseException if the database does not exist.

alterTableStats reports a NoSuchTableException if the table does not exist.

Note

alterTableStats is used when the following logical commands are executed:

tableExists Method

tableExists…​FIXME

Note
tableExists is used when…​FIXME

functionExists Method

functionExists…​FIXME

Note

functionExists is used in:

listFunctions Method

listFunctions…​FIXME

Note
listFunctions is used when…​FIXME

Invalidating Table Relation Cache (aka Refreshing Table) — refreshTable Method

refreshTable…​FIXME

Note
refreshTable is used when…​FIXME

loadFunctionResources Method

loadFunctionResources…​FIXME

Note
loadFunctionResources is used when…​FIXME

Altering (Updating) Temporary View (Logical Plan) — alterTempViewDefinition Method

alterTempViewDefinition alters the temporary view by updating an in-memory temporary table (when a database is not specified and the table has already been registered) or a global temporary table (when a database is specified and it is for global temporary tables).

Note
“Temporary table” and “temporary view” are synonyms.

alterTempViewDefinition returns true when an update could be executed and finished successfully.

Note
alterTempViewDefinition is used exclusively when AlterViewAsCommand logical command is executed.

Creating (Registering) Or Replacing Local Temporary View — createTempView Method

createTempView…​FIXME

Note
createTempView is used when…​FIXME

Creating (Registering) Or Replacing Global Temporary View — createGlobalTempView Method

createGlobalTempView simply requests the GlobalTempViewManager to create a global temporary view.

Note

createGlobalTempView is used when:

  • CreateViewCommand logical command is executed (for a global temporary view, i.e. when the view type is GlobalTempView)

  • CreateTempViewUsing logical command is executed (for a global temporary view, i.e. when the global flag is on)

createTable Method

createTable…​FIXME

Note
createTable is used when…​FIXME

Creating SessionCatalog Instance

SessionCatalog takes the following when created:

SessionCatalog initializes the internal registries and counters.

Finding Function by Name (Using FunctionRegistry) — lookupFunction Method

lookupFunction finds a function by name.

For a function with no database defined that exists in FunctionRegistry, lookupFunction requests FunctionRegistry to find the function (by its unqualified name, i.e. with no database).

If the name function has the database defined or does not exist in FunctionRegistry, lookupFunction uses the fully-qualified function name to check if the function exists in FunctionRegistry (by its fully-qualified name, i.e. with a database).

For other cases, lookupFunction requests ExternalCatalog to find the function and loads its resources. It then creates a corresponding temporary function and looks up the function again.

Note

lookupFunction is used when:

Finding Relation (Table or View) in Catalogs — lookupRelation Method

lookupRelation finds the name table in the catalogs (i.e. GlobalTempViewManager, ExternalCatalog or registry of temporary views) and gives a SubqueryAlias per table type.

Internally, lookupRelation looks up the name table using:

  1. GlobalTempViewManager when the database name of the table matches the name of GlobalTempViewManager

    1. Gives SubqueryAlias or reports a NoSuchTableException

  2. ExternalCatalog when the database name of the table is specified explicitly or the registry of temporary views does not contain the table

    1. Gives SubqueryAlias with View when the table is a view (aka temporary table)

    2. Gives SubqueryAlias with UnresolvedCatalogRelation otherwise

  3. The registry of temporary views

    1. Gives SubqueryAlias with the logical plan per the table as registered in the registry of temporary views

Note
lookupRelation considers default to be the name of the database if the name table does not specify the database explicitly.
Note

lookupRelation is used when:

Retrieving Table Metadata from External Catalog (Metastore) — getTableMetadata Method

getTableMetadata simply requests external catalog (metastore) for the table metadata.

Before requesting the external metastore, getTableMetadata makes sure that the database and table (of the input TableIdentifier) both exist. If either does not exist, getTableMetadata reports a NoSuchDatabaseException or NoSuchTableException, respectively.

Retrieving Table Metadata — getTempViewOrPermanentTableMetadata Method

Internally, getTempViewOrPermanentTableMetadata branches off per database.

When a database name is not specified, getTempViewOrPermanentTableMetadata finds a local temporary view and creates a CatalogTable (with VIEW table type and an undefined storage) or retrieves the table metadata from an external catalog.

With the database name of the GlobalTempViewManager, getTempViewOrPermanentTableMetadata requests GlobalTempViewManager for the global view definition and creates a CatalogTable (with the name of GlobalTempViewManager in table identifier, VIEW table type and an undefined storage) or reports a NoSuchTableException.

With the database name not of GlobalTempViewManager, getTempViewOrPermanentTableMetadata simply retrieves the table metadata from an external catalog.

Note

getTempViewOrPermanentTableMetadata is used when:

Reporting NoSuchDatabaseException When Specified Database Does Not Exist — requireDbExists Internal Method

requireDbExists reports a NoSuchDatabaseException if the specified database does not exist. Otherwise, requireDbExists does nothing.

reset Method

reset…​FIXME

Note
reset is used exclusively in the Spark SQL internal tests.

Dropping Global Temporary View — dropGlobalTempView Method

dropGlobalTempView simply requests the GlobalTempViewManager to remove the name global temporary view.

Note
dropGlobalTempView is used when…​FIXME

Dropping Table — dropTable Method

dropTable…​FIXME

Note

dropTable is used when:

Getting Global Temporary View (Definition) — getGlobalTempView Method

getGlobalTempView…​FIXME

Note
getGlobalTempView is used when…​FIXME

registerFunction Method

registerFunction…​FIXME

Note

registerFunction is used when:

  • SessionCatalog is requested to lookupFunction

  • HiveSessionCatalog is requested to lookupFunction0

  • CreateFunctionCommand logical command is executed

lookupFunctionInfo Method

lookupFunctionInfo…​FIXME

Note
lookupFunctionInfo is used when…​FIXME

alterTableDataSchema Method

alterTableDataSchema…​FIXME

Note
alterTableDataSchema is used when…​FIXME

关注公众号:spark技术分享

联系我们联系我们