关注 spark技术分享,
撸spark源码 玩spark最佳实践

SparkSession — The Entry Point to Spark SQL

SparkSession — The Entry Point to Spark SQL

SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.

As a Spark developer, you create a SparkSession using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session).

Once created, SparkSession allows for creating a DataFrame (based on an RDD or a Scala Seq), creating a Dataset, accessing the Spark SQL services (e.g. ExperimentalMethods, ExecutionListenerManager, UDFRegistration), executing a SQL query, loading a table and the last but not least accessing DataFrameReader interface to load a dataset of the format of your choice (to some extent).

Note

spark object in spark-shell (the instance of SparkSession that is auto-created) has Hive support enabled.

In order to disable the pre-configured Hive support in the spark object, use spark.sql.catalogImplementation internal configuration property with in-memory value (that uses InMemoryCatalog external catalog instead).

You can have as many SparkSessions as you want in a single Spark application. The common use case is to keep relational entities separate logically in catalogs per SparkSession.

In the end, you stop a SparkSession using SparkSession.stop method.

Table 1. SparkSession API (Object and Instance Methods)
Method Description

active

(New in 2.4.0)

builder

Object method to create a Builder to get the current SparkSession instance or create a new one.

catalog

Access to the current metadata catalog of relational entities, e.g. database(s), tables, functions, table columns, and temporary views.

clearActiveSession

Object method

clearDefaultSession

Object method

close

conf

Access to the current runtime configuration

createDataFrame

createDataset

emptyDataFrame

emptyDataset

experimental

Access to the current ExperimentalMethods

getActiveSession

Object method

getDefaultSession

Object method

implicits

listenerManager

Access to the current ExecutionListenerManager

newSession

Creates a new SparkSession

range

Creates a Dataset[java.lang.Long]

read

Access to the current DataFrameReader to load data from external data sources

sessionState

Access to the current SessionState

Internally, sessionState clones the optional parent SessionState (if given when creating the SparkSession) or creates a new SessionState using BaseSessionStateBuilder as defined by spark.sql.catalogImplementation configuration property:

setActiveSession

Object method

setDefaultSession

Object method

sharedState

Access to the current SharedState

sparkContext

Access to the underlying SparkContext

sql

“Executes” a SQL query

sqlContext

Access to the underlying SQLContext

stop

Stops the associated SparkContext

table

Loads data from a table

time

Executes a code block and prints out (to standard output) the time taken to execute it

udf

Access to the current UDFRegistration

version

Returns the version of Apache Spark

Note
baseRelationToDataFrame acts as a mechanism to plug BaseRelation object hierarchy in into LogicalPlan object hierarchy that SparkSession uses to bridge them.

Creating SparkSession Using Builder Pattern — builder Object Method

builder creates a new Builder that you use to build a fully-configured SparkSession using a fluent API.

Tip
Read about Fluent interface design pattern in Wikipedia, the free encyclopedia.

Accessing Version of Spark — version Method

version returns the version of Apache Spark in use.

Internally, version uses spark.SPARK_VERSION value that is the version property in spark-version-info.properties properties file on CLASSPATH.

Creating Empty Dataset (Given Encoder) — emptyDataset Operator

emptyDataset creates an empty Dataset (assuming that future records being of type T).

emptyDataset creates a LocalRelation logical query plan.

Creating Dataset from Local Collections or RDDs — createDataset methods

createDataset is an experimental API to create a Dataset from a local Scala collection, i.e. Seq[T], Java’s List[T], or a distributed RDD[T].

createDataset creates a LocalRelation (for the input data collection) or LogicalRDD (for the input RDD[T]) logical operators.

Tip

You may want to consider implicits object and toDS method instead.

Internally, createDataset first looks up the implicit expression encoder in scope to access the AttributeReferences (of the schema).

Note
Only unresolved expression encoders are currently supported.

The expression encoder is then used to map elements (of the input Seq[T]) into a collection of InternalRows. With the references and rows, createDataset returns a Dataset with a LocalRelation logical query plan.

Creating Dataset With Single Long Column — range Operator

range family of methods create a Dataset of Long numbers.

Note
The three first variants (that do not specify numPartitions explicitly) use SparkContext.defaultParallelism for the number of partitions numPartitions.

Internally, range creates a new Dataset[Long] with Range logical plan and Encoders.LONG encoder.

Creating Empty DataFrame —  emptyDataFrame method

emptyDataFrame creates an empty DataFrame (with no rows and columns).

It calls createDataFrame with an empty RDD[Row] and an empty schema StructType(Nil).

Creating DataFrames from Local Collections or RDDs — createDataFrame Method

createDataFrame creates a DataFrame using RDD[Row] and the input schema. It is assumed that the rows in rowRDD all match the schema.

Caution
FIXME

Executing SQL Queries (aka SQL Mode) — sql Method

sql executes the sqlText SQL statement and creates a DataFrame.

Note

sql is imported in spark-shell so you can execute SQL statements as if sql were a part of the environment.

Internally, sql requests the current ParserInterface to execute a SQL query that gives a LogicalPlan.

Note
sql uses SessionState to access the current ParserInterface.

sql then creates a DataFrame using the current SparkSession (itself) and the LogicalPlan.

Tip

spark-sql is the main SQL environment in Spark to work with pure SQL statements (where you do not have to use Scala to execute them).

Accessing UDFRegistration — udf Attribute

udf attribute gives access to UDFRegistration that allows registering user-defined functions for SQL-based queries.

Internally, it is simply an alias for SessionState.udfRegistration.

Loading Data From Table — table Method

  1. Parses tableName to a TableIdentifier and calls the other table

table creates a DataFrame (wrapper) from the input tableName table (but only if available in the session catalog).

Accessing Metastore — catalog Attribute

catalog attribute is a (lazy) interface to the current metastore, i.e. data catalog (of relational entities like databases, tables, functions, table columns, and views).

Tip
All methods in Catalog return Datasets.

Internally, catalog creates a CatalogImpl (that uses the current SparkSession).

Accessing DataFrameReader — read method

read method returns a DataFrameReader that is used to read data from external storage systems and load it into a DataFrame.

Getting Runtime Configuration — conf Attribute

conf returns the current RuntimeConfig.

Internally, conf creates a RuntimeConfig (when requested the very first time and cached afterwards) with the SQLConf of the SessionState.

readStream method

readStream returns a new DataStreamReader.

streams Attribute

streams attribute gives access to StreamingQueryManager (through SessionState).

experimentalMethods Attribute

experimentalMethods is an extension point with ExperimentalMethods that is a per-session collection of extra strategies and Rule[LogicalPlan]s.

Note
experimental is used in SparkPlanner and SparkOptimizer. Hive and Structured Streaming use it for their own extra strategies and optimization rules.

Creating SparkSession Instance — newSession method

newSession creates (starts) a new SparkSession (with the current SparkContext and SharedState).

Stopping SparkSession — stop Method

stop stops the SparkSession, i.e. stops the underlying SparkContext.

Create DataFrame from BaseRelation — baseRelationToDataFrame Method

Internally, baseRelationToDataFrame creates a DataFrame from the input BaseRelation wrapped inside LogicalRelation.

Note
LogicalRelation is an logical plan adapter for BaseRelation (so BaseRelation can be part of a logical plan).
Note

baseRelationToDataFrame is used when:

Creating SessionState Instance — instantiateSessionState Internal Method

instantiateSessionState finds the className that is then used to create and build a BaseSessionStateBuilder.

instantiateSessionState may report an IllegalArgumentException while instantiating the class of a SessionState:

Note
instantiateSessionState is used exclusively when SparkSession is requested for SessionState per spark.sql.catalogImplementation configuration property (and one is not available yet).

sessionStateClassName Internal Method

sessionStateClassName gives the name of the class of the SessionState per spark.sql.catalogImplementation, i.e.

Note
sessionStateClassName is used exclusively when SparkSession is requested for the SessionState (and one is not available yet).

Creating DataFrame From RDD Of Internal Binary Rows and Schema — internalCreateDataFrame Internal Method

internalCreateDataFrame creates a DataFrame with a LogicalRDD.

Note

internalCreateDataFrame is used when:

Creating SparkSession Instance

SparkSession takes the following when created:

clearActiveSession Object Method

clearActiveSession…​FIXME

clearDefaultSession Object Method

clearDefaultSession…​FIXME

Accessing ExperimentalMethods — experimental Method

experimental…​FIXME

getActiveSession Object Method

getActiveSession…​FIXME

getDefaultSession Object Method

getDefaultSession…​FIXME

Accessing ExecutionListenerManager — listenerManager Method

listenerManager…​FIXME

Accessing SessionState — sessionState Lazy Attribute

sessionState…​FIXME

setActiveSession Object Method

setActiveSession…​FIXME

setDefaultSession Object Method

setDefaultSession…​FIXME

Accessing SharedState — sharedState Method

sharedState…​FIXME

Measuring Duration of Executing Code Block — time Method

time…​FIXME

赞(1) 打赏
未经允许不得转载:spark技术分享 » SparkSession — The Entry Point to Spark SQL
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏