spark-sql-spark技术分享-第6页

SQLExecution Helper Object

2013-06-11admin阅读(3579)

SQLExecution Helper Object

SQLExecution defines spark.sql.execution.id Spark property that is used to track multiple Spark jobs that should all together constitute a single structured query execution (that could be easily reported as a single execution unit).



import org.apache.spark.sql.execution.SQLExecution
scala> println(SQLExecution.EXECUTION_ID_KEY)
spark.sql.execution.id

1

2

3

4

5

6

7

import org.apache.spark.sql.execution.SQLExecution

scala> println(SQLExecution.EXECUTION_ID_KEY)

spark.sql.execution.id

Structured query actions are executed using SQLExecution.withNewExecutionId static method that sets spark.sql.execution.id as Spark Core’s local property and “stitches” different Spark jobs as parts of one structured query action (that you can then see in web UI’s SQL tab).

Tip

Use SparkListener to listen to SparkListenerSQLExecutionStart events and know the execution ids of structured queries that have been executed in a Spark SQL application.



// "SQLAppStatusListener" idea is borrowed from
// Spark SQL's org.apache.spark.sql.execution.ui.SQLAppStatusListener
import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent}
import org.apache.spark.sql.execution.ui.{SparkListenerDriverAccumUpdates, SparkListenerSQLExecutionEnd, SparkListenerSQLExecutionStart}
public class SQLAppStatusListener extends SparkListener {
  override def onOtherEvent(event: SparkListenerEvent): Unit = event match {
    case e: SparkListenerSQLExecutionStart => onExecutionStart(e)
    case e: SparkListenerSQLExecutionEnd => onExecutionEnd(e)
    case e: SparkListenerDriverAccumUpdates => onDriverAccumUpdates(e)
    case _ => // Ignore
  }
  def onExecutionStart(event: SparkListenerSQLExecutionStart): Unit = {
    // Find the QueryExecution for the Dataset action that triggered the event
    // This is the SQL-specific way
    import org.apache.spark.sql.execution.SQLExecution
    queryExecution = SQLExecution.getQueryExecution(event.executionId)
  }
  def onJobStart(jobStart: SparkListenerJobStart): Unit = {
    // Find the QueryExecution for the Dataset action that triggered the event
    // This is a general Spark Core way using local properties
    import org.apache.spark.sql.execution.SQLExecution
    val executionIdStr = jobStart.properties.getProperty(SQLExecution.EXECUTION_ID_KEY)
    // Note that the Spark job may or may not be a part of a structured query
    if (executionIdStr != null) {
      queryExecution = SQLExecution.getQueryExecution(executionIdStr.toLong)
    }
  }
  def onExecutionEnd(event: SparkListenerSQLExecutionEnd): Unit = {}
  def onDriverAccumUpdates(event: SparkListenerDriverAccumUpdates): Unit = {}
}

val sqlListener = new SQLAppStatusListener()
spark.sparkContext.addSparkListener(sqlListener)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

// "SQLAppStatusListener" idea is borrowed from

// Spark SQL's org.apache.spark.sql.execution.ui.SQLAppStatusListener

import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent}

import org.apache.spark.sql.execution.ui.{SparkListenerDriverAccumUpdates, SparkListenerSQLExecutionEnd, SparkListenerSQLExecutionStart}

public class SQLAppStatusListener extends SparkListener {

override def onOtherEvent(event: SparkListenerEvent): Unit = event match {

case e: SparkListenerSQLExecutionStart => onExecutionStart(e)

case e: SparkListenerSQLExecutionEnd => onExecutionEnd(e)

case e: SparkListenerDriverAccumUpdates => onDriverAccumUpdates(e)

case _ => // Ignore

}

def onExecutionStart(event: SparkListenerSQLExecutionStart): Unit = {

// Find the QueryExecution for the Dataset action that triggered the event

// This is the SQL-specific way

import org.apache.spark.sql.execution.SQLExecution

queryExecution = SQLExecution.getQueryExecution(event.executionId)

}

def onJobStart(jobStart: SparkListenerJobStart): Unit = {

// Find the QueryExecution for the Dataset action that triggered the event

// This is a general Spark Core way using local properties

import org.apache.spark.sql.execution.SQLExecution

val executionIdStr = jobStart.properties.getProperty(SQLExecution.EXECUTION_ID_KEY)

// Note that the Spark job may or may not be a part of a structured query

if (executionIdStr != null) {

queryExecution = SQLExecution.getQueryExecution(executionIdStr.toLong)

}

def onExecutionEnd(event: SparkListenerSQLExecutionEnd): Unit = {}

def onDriverAccumUpdates(event: SparkListenerDriverAccumUpdates): Unit = {}

}

val sqlListener = new SQLAppStatusListener()

spark.sparkContext.addSparkListener(sqlListener)

Note	Jobs without spark.sql.execution.id key are not considered to belong to SQL query executions.

SQLExecution keeps track of all execution ids and their QueryExecutions in executionIdToQueryExecution internal registry.

Tip	Use SQLExecution.getQueryExecution to find the QueryExecution for an execution id.

Executing Dataset Action (with Zero or More Spark Jobs) Under New Execution Id — `withNewExecutionId` Method



withNewExecutionId[T](
  sparkSession: SparkSession,
  queryExecution: QueryExecution)(body: => T): T

1

2

3

4

5

6

7

withNewExecutionId[T](

sparkSession: SparkSession,

queryExecution: QueryExecution)(body: => T): T

withNewExecutionId executes body query action with a new execution id (given as the input executionId or auto-generated) so that all Spark jobs that have been scheduled by the query action could be marked as parts of the same Dataset action execution.

withNewExecutionId allows for collecting all the Spark jobs (even executed on separate threads) together under a single SQL query execution for reporting purposes, e.g. to reporting them as one single structured query in web UI.

Note	If there is another execution id already set, it is replaced for the course of the current action.

In addition, the QueryExecution variant posts SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd events (to LiveListenerBus event bus) before and after executing the body action, respectively. It is used to inform SQLListener when a SQL query execution starts and ends.

Note	Nested execution ids are not supported in the `QueryExecution` variant.

Note

withNewExecutionId is used when:

Dataset is requested to Dataset.withNewExecutionId
Dataset is requested to withAction
DataFrameWriter is requested to run a command
Spark Structured Streaming’s StreamExecution commits a batch to a streaming sink
Spark Thrift Server’s SparkSQLDriver runs a command

Finding QueryExecution for Execution ID — `getQueryExecution` Method



getQueryExecution(executionId: Long): QueryExecution

1

2

3

4

5

getQueryExecution(executionId: Long): QueryExecution

getQueryExecution gives the QueryExecution for the executionId or null if not found.

Executing Action (with Zero or More Spark Jobs) Tracked Under Given Execution Id — `withExecutionId` Method



withExecutionId[T](
  sc: SparkContext,
  executionId: String)(body: => T): T

1

2

3

4

5

6

7

withExecutionId[T](

sc: SparkContext,

executionId: String)(body: => T): T

withExecutionId executes the body action as part of executing multiple Spark jobs under executionId execution identifier.



def body = println("Hello World")
scala> SQLExecution.withExecutionId(sc = spark.sparkContext, executionId = "Custom Name")(body)
Hello World

1

2

3

4

5

6

7

def body = println("Hello World")

scala> SQLExecution.withExecutionId(sc = spark.sparkContext, executionId = "Custom Name")(body)

Hello World

Note	`withExecutionId` is used when: `BroadcastExchangeExec` is requested to prepare for execution (and initializes relationFuture for the first time) `SubqueryExec` is requested to prepare for execution (and initializes relationFuture for the first time)

SparkSQLEnv

2013-06-10admin阅读(1071)

SparkSQLEnv

Caution

FIXME

Thrift JDBC/ODBC Server — Spark Thrift Server (STS)

2013-06-09admin阅读(2075)

Thrift JDBC/ODBC Server — Spark Thrift Server (STS)

Thrift JDBC/ODBC Server (aka Spark Thrift Server or STS) is Spark SQL’s port of Apache Hive’s HiveServer2 that allows JDBC/ODBC clients to execute SQL queries over JDBC and ODBC protocols on Apache Spark.

With Spark Thrift Server, business users can work with their shiny Business Intelligence (BI) tools, e.g. Tableau or Microsoft Excel, and connect to Apache Spark using the ODBC interface. That brings the in-memory distributed capabilities of Spark SQL’s query engine (with all the Catalyst query optimizations you surely like very much) to environments that were initially “disconnected”.

Beside, SQL queries in Spark Thrift Server share the same SparkContext that helps further improve performance of SQL queries using the same data sources.

Spark Thrift Server is a Spark standalone application that you start using start-thriftserver.sh and stop using stop-thriftserver.sh shell scripts.

Spark Thrift Server has its own tab in web UI — JDBC/ODBC Server available at /sqlserver URL.

Figure 1. Spark Thrift Server’s web UI

Spark Thrift Server can work in HTTP or binary transport modes.

Use beeline command-line tool or SQuirreL SQL Client or Spark SQL’s DataSource API to connect to Spark Thrift Server through the JDBC interface.

Spark Thrift Server extends spark-submit‘s command-line options with --hiveconf [prop=value].

Important

You have to enable hive-thriftserver build profile to include Spark Thrift Server in your build.



./build/mvn -Phadoop-2.7,yarn,mesos,hive,hive-thriftserver -DskipTests clean install

1

2

3

4

5

./build/mvn -Phadoop-2.7,yarn,mesos,hive,hive-thriftserver -DskipTests clean install

Refer to Building Apache Spark from Sources.

Tip

Enable INFO or DEBUG logging levels for org.apache.spark.sql.hive.thriftserver and org.apache.hive.service.server loggers to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.hive.thriftserver=DEBUG
log4j.logger.org.apache.hive.service.server=INFO

1

2

3

4

5

6

log4j.logger.org.apache.spark.sql.hive.thriftserver=DEBUG

log4j.logger.org.apache.hive.service.server=INFO

Refer to Logging.

Starting Thrift JDBC/ODBC Server — `start-thriftserver.sh`

You can start Thrift JDBC/ODBC Server using ./sbin/start-thriftserver.sh shell script.

With INFO logging level enabled, when you execute the script you should see the following INFO messages in the logs:



INFO HiveThriftServer2: Started daemon with process name: 16633@japila.local
INFO HiveThriftServer2: Starting SparkContext
...
INFO HiveThriftServer2: HiveThriftServer2 started

1

2

3

4

5

6

7

8

INFO HiveThriftServer2: Started daemon with process name: 16633@japila.local

INFO HiveThriftServer2: Starting SparkContext

...

INFO HiveThriftServer2: HiveThriftServer2 started

Internally, start-thriftserver.sh script submits org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 standalone application for execution (using spark-submit).



$ ./bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

1

2

3

4

5

$ ./bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

Tip	Using the more explicit approach with `spark-submit` to start Spark Thrift Server could be easier to trace execution by seeing the logs printed out to the standard output and hence terminal directly.

Using Beeline JDBC Client to Connect to Spark Thrift Server

beeline is a command-line tool that allows you to access Spark Thrift Server using the JDBC interface on command line. It is included in the Spark distribution in bin directory.



$ ./bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline>

1

2

3

4

5

6

7

$ ./bin/beeline

Beeline version 1.2.1.spark2 by Apache Hive

beeline>

You can connect to Spark Thrift Server using connect command as follows:



beeline> !connect jdbc:hive2://localhost:10000

1

2

3

4

5

beeline> !connect jdbc:hive2://localhost:10000

When connecting in non-secure mode, simply enter the username on your machine and a blank password.



beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: jacek
Enter password for jdbc:hive2://localhost:10000: [press ENTER]
Connected to: Spark SQL (version 2.3.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>

1

2

3

4

5

6

7

8

9

10

11

12

beeline> !connect jdbc:hive2://localhost:10000

Connecting to jdbc:hive2://localhost:10000

Enter username for jdbc:hive2://localhost:10000: jacek

Enter password for jdbc:hive2://localhost:10000: [press ENTER]

Connected to: Spark SQL (version 2.3.0)

Driver: Hive JDBC (version 1.2.1.spark2)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://localhost:10000>

Once connected, you can send SQL queries (as if Spark SQL were a JDBC-compliant database).



0: jdbc:hive2://localhost:10000> show databases;
+---------------+--+
| databaseName  |
+---------------+--+
| default       |
+---------------+--+
1 row selected (0.074 seconds)

1

2

3

4

5

6

7

8

9

10

11

0: jdbc:hive2://localhost:10000> show databases;

+---------------+--+

| databaseName |

+---------------+--+

| default |

+---------------+--+

1 row selected (0.074 seconds)

Connecting to Spark Thrift Server using SQuirreL SQL Client 3.7.1

Spark Thrift Server allows for remote access to Spark SQL using JDBC protocol.

Note	This section was tested with SQuirreL SQL Client 3.7.1 (`squirrelsql-3.7.1-standard.zip`) on Mac OS X.

SQuirreL SQL Client is a Java SQL client for JDBC-compliant databases.

Run the client using java -jar squirrel-sql.jar.

Figure 2. SQuirreL SQL Client

You first have to configure a JDBC driver for Spark Thrift Server. Spark Thrift Server uses org.spark-project.hive:hive-jdbc:1.2.1.spark2 dependency that is the JDBC driver (that also downloads transitive dependencies).

Tip	The Hive JDBC Driver, i.e. `hive-jdbc-1.2.1.spark2.jar` and other jar files are in `jars` directory of the Apache Spark distribution (or `assembly/target/scala-2.11/jars` for local builds).

Table 1. SQuirreL SQL Client’s Connection Parameters
Parameter	Description
Name	Spark Thrift Server
Example URL	`jdbc:hive2://localhost:10000`
Extra Class Path	All the jar files of your Spark distribution
Class Name	`org.apache.hive.jdbc.HiveDriver`

spark thriftserver squirrel adddriver.png

Figure 3. Adding Hive JDBC Driver in SQuirreL SQL Client

With the Hive JDBC Driver defined, you can connect to Spark SQL Thrift Server.

spark thriftserver squirrel addalias.png

Figure 4. Adding Hive JDBC Driver in SQuirreL SQL Client

Since you did not specify the database to use, Spark SQL’s default is used.

spark thriftserver squirrel metadata.png

Figure 5. SQuirreL SQL Client Connected to Spark Thrift Server (Metadata Tab)

Below is show tables SQL query in SQuirrel SQL Client executed in Spark SQL through Spark Thrift Server.

spark thriftserver squirrel show tables.png

Figure 6. show tables SQL Query in SQuirrel SQL Client using Spark Thrift Server

Using Spark SQL’s DataSource API to Connect to Spark Thrift Server

What might seem a quite artificial setup at first is accessing Spark Thrift Server using Spark SQL’s DataSource API, i.e. DataFrameReader‘s jdbc method.

Tip

When executed in local mode, Spark Thrift Server and spark-shell will try to access the same Hive Warehouse’s directory that will inevitably lead to an error.

Use spark.sql.warehouse.dir to point to another directory for spark-shell.



./bin/spark-shell --conf spark.sql.warehouse.dir=/tmp/spark-warehouse

1

2

3

4

5

./bin/spark-shell --conf spark.sql.warehouse.dir=/tmp/spark-warehouse

You should also not share the same home directory between them since metastore_db becomes an issue.



// Inside spark-shell
// Paste in :paste mode
val df = spark
  .read
  .option("url", "jdbc:hive2://localhost:10000") (1)
  .option("dbtable", "people") (2)
  .format("jdbc")
  .load

1

2

3

4

5

6

7

8

9

10

11

12

// Inside spark-shell

// Paste in :paste mode

val df = spark

.read

.option("url", "jdbc:hive2://localhost:10000") (1)

.option("dbtable", "people") (2)

.format("jdbc")

.load

Connect to Spark Thrift Server at localhost on port 10000
Use people table. It assumes that people table is available.

`ThriftServerTab` — web UI’s Tab for Spark Thrift Server

ThriftServerTab is…FIXME

Caution

FIXME Elaborate

Stopping Thrift JDBC/ODBC Server — `stop-thriftserver.sh`

You can stop a running instance of Thrift JDBC/ODBC Server using ./sbin/stop-thriftserver.sh shell script.

With DEBUG logging level enabled, you should see the following messages in the logs:



ERROR HiveThriftServer2: RECEIVED SIGNAL TERM
DEBUG SparkSQLEnv: Shutting down Spark SQL Environment
INFO HiveServer2: Shutting down HiveServer2
INFO BlockManager: BlockManager stopped
INFO SparkContext: Successfully stopped SparkContext

1

2

3

4

5

6

7

8

9

ERROR HiveThriftServer2: RECEIVED SIGNAL TERM

DEBUG SparkSQLEnv: Shutting down Spark SQL Environment

INFO HiveServer2: Shutting down HiveServer2

INFO BlockManager: BlockManager stopped

INFO SparkContext: Successfully stopped SparkContext

Tip	You can also send `SIGTERM` signal to the process of Thrift JDBC/ODBC Server, i.e. `kill [PID]` that triggers the same sequence of shutdown steps as `stop-thriftserver.sh`.

Transport Mode

Spark Thrift Server can be configured to listen in two modes (aka transport modes):

Binary mode — clients should send thrift requests in binary
HTTP mode — clients send thrift requests over HTTP.

You can control the transport modes using
HIVE_SERVER2_TRANSPORT_MODE=http or hive.server2.transport.mode (default: binary). It can be binary (default) or http.

`main` method

Thrift JDBC/ODBC Server is a Spark standalone application that you…

Caution

FIXME

HiveThriftServer2Listener

Caution

FIXME

SparkSqlParser — Default SQL Parser

2013-06-08admin阅读(2134)

SparkSqlParser — Default SQL Parser

SparkSqlParser is the default SQL parser of the SQL statements supported in Spark SQL.

SparkSqlParser supports variable substitution.

SparkSqlParser uses SparkSqlAstBuilder (as AstBuilder).

Note	Spark SQL supports SQL statements as described in SqlBase.g4 ANTLR grammar.

SparkSqlParser is available as sqlParser of a SessionState.



val spark: SparkSession = ...
spark.sessionState.sqlParser

1

2

3

4

5

6

val spark: SparkSession = ...

spark.sessionState.sqlParser

SparkSqlParser is used to translate an expression to the corresponding Column in the following:

expr function
Dataset.selectExpr operator
Dataset.filter operator
Dataset.where operator



scala> expr("token = 'hello'")
16/07/07 18:32:53 INFO SparkSqlParser: Parsing command: token = 'hello'
res0: org.apache.spark.sql.Column = (token = hello)

1

2

3

4

5

6

7

scala> expr("token = 'hello'")

16/07/07 18:32:53 INFO SparkSqlParser: Parsing command: token = 'hello'

res0: org.apache.spark.sql.Column = (token = hello)

SparkSqlParser is used to parse table strings into their corresponding table identifiers in the following:

table methods in DataFrameReader and SparkSession
insertInto and saveAsTable methods of DataFrameWriter
createExternalTable and refreshTable methods of Catalog (and SessionState)

SparkSqlParser is used to translate a SQL text to its corresponding logical operator in SparkSession.sql method.

Tip

Enable INFO logging level for org.apache.spark.sql.execution.SparkSqlParser logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.execution.SparkSqlParser=INFO

1

2

3

4

5

log4j.logger.org.apache.spark.sql.execution.SparkSqlParser=INFO

Refer to Logging.

Variable Substitution

Caution

FIXME See SparkSqlParser and substitutor.

SparkSqlAstBuilder

2013-06-07admin阅读(3129)

SparkSqlAstBuilder

SparkSqlAstBuilder is an AstBuilder that converts valid Spark SQL statements into Catalyst expressions, logical plans or table identifiers (using visit callback methods).

Note	Spark SQL uses ANTLR parser generator for parsing structured text.

SparkSqlAstBuilder is created exclusively when SparkSqlParser is created (which is when SparkSession is requested for the lazily-created SessionState).

Figure 1. Creating SparkSqlAstBuilder



scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface

import org.apache.spark.sql.execution.SparkSqlParser
val sqlParser = spark.sessionState.sqlParser.asInstanceOf[SparkSqlParser]

scala> :type sqlParser.astBuilder
org.apache.spark.sql.execution.SparkSqlAstBuilder

1

2

3

4

5

6

7

8

9

10

11

12

scala> :type spark.sessionState.sqlParser

org.apache.spark.sql.catalyst.parser.ParserInterface

import org.apache.spark.sql.execution.SparkSqlParser

val sqlParser = spark.sessionState.sqlParser.asInstanceOf[SparkSqlParser]

scala> :type sqlParser.astBuilder

org.apache.spark.sql.execution.SparkSqlAstBuilder

SparkSqlAstBuilder takes a SQLConf when created.

Note

SparkSqlAstBuilder can also be temporarily created for expr standard function (to create column expressions).



val c = expr("from_json(value, schema)")
scala> :type c
org.apache.spark.sql.Column

scala> :type c.expr
org.apache.spark.sql.catalyst.expressions.Expression

scala> println(c.expr.numberedTreeString)
00 'from_json('value, 'schema)
01 :- 'value
02 +- 'schema

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

val c = expr("from_json(value, schema)")

scala> :type c

org.apache.spark.sql.Column

scala> :type c.expr

org.apache.spark.sql.catalyst.expressions.Expression

scala> println(c.expr.numberedTreeString)

00 'from_json('value, 'schema)

01 :- 'value

02 +- 'schema

Callback Method ANTLR rule / labeled alternative Spark SQL Entity

visitAnalyze

#analyze

AnalyzeColumnCommand logical command for ANALYZE TABLE with FOR COLUMNS clause (but no PARTITION specification)



val sqlText = "ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS id, p1"
val plan = spark.sql(sqlText).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeColumnCommand
val cmd = plan.asInstanceOf[AnalyzeColumnCommand]
scala> println(cmd)
AnalyzeColumnCommand `t1`, [id, p1]

1

2

3

4

5

6

7

8

9

10

val sqlText = "ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS id, p1"

val plan = spark.sql(sqlText).queryExecution.logical

import org.apache.spark.sql.execution.command.AnalyzeColumnCommand

val cmd = plan.asInstanceOf[AnalyzeColumnCommand]

scala> println(cmd)

AnalyzeColumnCommand `t1`, [id, p1]

AnalyzePartitionCommand logical command for ANALYZE TABLE with PARTITION specification (but no FOR COLUMNS clause)



val analyzeTable = "ANALYZE TABLE t1 PARTITION (p1, p2) COMPUTE STATISTICS"
val plan = spark.sql(analyzeTable).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzePartitionCommand
val cmd = plan.asInstanceOf[AnalyzePartitionCommand]
scala> println(cmd)
AnalyzePartitionCommand `t1`, Map(p1 -> None, p2 -> None), false

1

2

3

4

5

6

7

8

9

10

val analyzeTable = "ANALYZE TABLE t1 PARTITION (p1, p2) COMPUTE STATISTICS"

val plan = spark.sql(analyzeTable).queryExecution.logical

import org.apache.spark.sql.execution.command.AnalyzePartitionCommand

val cmd = plan.asInstanceOf[AnalyzePartitionCommand]

scala> println(cmd)

AnalyzePartitionCommand `t1`, Map(p1 -> None, p2 -> None), false

AnalyzeTableCommand logical command for ANALYZE TABLE with neither PARTITION specification nor FOR COLUMNS clause



val sqlText = "ANALYZE TABLE t1 COMPUTE STATISTICS NOSCAN"
val plan = spark.sql(sqlText).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeTableCommand
val cmd = plan.asInstanceOf[AnalyzeTableCommand]
scala> println(cmd)
AnalyzeTableCommand `t1`, false

1

2

3

4

5

6

7

8

9

10

val sqlText = "ANALYZE TABLE t1 COMPUTE STATISTICS NOSCAN"

val plan = spark.sql(sqlText).queryExecution.logical

import org.apache.spark.sql.execution.command.AnalyzeTableCommand

val cmd = plan.asInstanceOf[AnalyzeTableCommand]

scala> println(cmd)

AnalyzeTableCommand `t1`, false

Note	`visitAnalyze` supports `NOSCAN` identifier only (and reports a `ParseException` if not used). `NOSCAN` is used for `AnalyzePartitionCommand` and `AnalyzeTableCommand` logical commands only.

visitBucketSpec

#bucketSpec

visitCacheTable

#cacheTable

CacheTableCommand logical command for CACHE LAZY? TABLE

(AS? [query])?

visitCreateHiveTable

#createHiveTable

CreateTable

visitCreateTable

#createTable

CreateTable logical operator for CREATE TABLE … AS …
CreateTempViewUsing logical operator for CREATE TEMPORARY VIEW … USING …

visitCreateView

#createView

CreateViewCommand for CREATE VIEW AS SQL statement



CREATE [OR REPLACE] [[GLOBAL] TEMPORARY]
VIEW [IF NOT EXISTS] tableIdentifier
[identifierCommentList] [COMMENT STRING]
[PARTITIONED ON identifierList]
[TBLPROPERTIES tablePropertyList] AS query

1

2

3

4

5

6

7

8

9

CREATE [OR REPLACE] [[GLOBAL] TEMPORARY]

VIEW [IF NOT EXISTS] tableIdentifier

[identifierCommentList] [COMMENT STRING]

[PARTITIONED ON identifierList]

[TBLPROPERTIES tablePropertyList] AS query

visitCreateTempViewUsing

#createTempViewUsing

CreateTempViewUsing for CREATE TEMPORARY VIEW … USING

visitDescribeTable

#describeTable

DescribeColumnCommand logical command for DESCRIBE TABLE with a single column only (i.e. no PARTITION specification).



val sqlCmd = "DESC EXTENDED t1 p1"
val plan = spark.sql(sqlCmd).queryExecution.logical
import org.apache.spark.sql.execution.command.DescribeColumnCommand
val cmd = plan.asInstanceOf[DescribeColumnCommand]
scala> println(cmd)
DescribeColumnCommand `t1`, [p1], true

1

2

3

4

5

6

7

8

9

10

val sqlCmd = "DESC EXTENDED t1 p1"

val plan = spark.sql(sqlCmd).queryExecution.logical

import org.apache.spark.sql.execution.command.DescribeColumnCommand

val cmd = plan.asInstanceOf[DescribeColumnCommand]

scala> println(cmd)

DescribeColumnCommand `t1`, [p1], true

DescribeTableCommand logical command for all other variants of DESCRIBE TABLE (i.e. no column)



val sqlCmd = "DESC t1"
val plan = spark.sql(sqlCmd).queryExecution.logical
import org.apache.spark.sql.execution.command.DescribeTableCommand
val cmd = plan.asInstanceOf[DescribeTableCommand]
scala> println(cmd)
DescribeTableCommand `t1`, false

1

2

3

4

5

6

7

8

9

10

val sqlCmd = "DESC t1"

val plan = spark.sql(sqlCmd).queryExecution.logical

import org.apache.spark.sql.execution.command.DescribeTableCommand

val cmd = plan.asInstanceOf[DescribeTableCommand]

scala> println(cmd)

DescribeTableCommand `t1`, false

visitInsertOverwriteHiveDir

#insertOverwriteHiveDir

visitShowCreateTable

#showCreateTable

ShowCreateTableCommand logical command for SHOW CREATE TABLE SQL statement



SHOW CREATE TABLE tableIdentifier

1

2

3

4

5

SHOW CREATE TABLE tableIdentifier

Table 2. SparkSqlAstBuilder’s Parsing Handlers
Parsing Handler	LogicalPlan Added
`withRepartitionByExpression`

ParserInterface — SQL Parser Contract

2013-06-06admin阅读(4848)

ParserInterface — SQL Parser Contract

ParserInterface is the contract of SQL parsers that can parse Expressions, LogicalPlans, TableIdentifiers, and StructTypes given the textual representation of SQL statements.



package org.apache.spark.sql.catalyst.parser

trait ParserInterface {
  def parseExpression(sqlText: String): Expression
  def parsePlan(sqlText: String): LogicalPlan
  def parseTableIdentifier(sqlText: String): TableIdentifier
  def parseTableSchema(sqlText: String): StructType
}

1

2

3

4

5

6

7

8

9

10

11

12

package org.apache.spark.sql.catalyst.parser

trait ParserInterface {

def parseExpression(sqlText: String): Expression

def parsePlan(sqlText: String): LogicalPlan

def parseTableIdentifier(sqlText: String): TableIdentifier

def parseTableSchema(sqlText: String): StructType

}

Note	AbstractSqlParser is the one and only known extension of the `ParserInterface` Contract in Spark SQL.

ParserInterface is available as sqlParser in SessionState.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface

1

2

3

4

5

6

7

8

9

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.sqlParser

org.apache.spark.sql.catalyst.parser.ParserInterface

Table 1. ParserInterface Contract
Method	Description
`parseExpression`	Parses a SQL text to an Expression Used in the following: Dataset transformations: Dataset.selectExpr, Dataset.filter and Dataset.where expr standard function
`parsePlan`	Parses a SQL text to a LogicalPlan Used when: `SparkSession` is requested to execute a SQL statement `SessionCatalog` is requested to find a view in catalogs
`parseTableIdentifier`	Parses a SQL text to a `TableIdentifier` Used when: `DataFrameWriter` is requested to insertInto and saveAsTable Dataset basic actions are used: Dataset.createTempView, Dataset.createOrReplaceTempView, Dataset.createGlobalTempView and Dataset.createOrReplaceGlobalTempView (using Dataset.createTempViewCommand) `SparkSession` is requested to load data from a table `CatalogImpl` is requested to listColumns, getTable, tableExists, createTable, recoverPartitions, refreshTable `SessionState` is requested to refreshTable
`parseTableSchema`	Parses a SQL text to a StructType Used when: `StructType` is requested to create a StructType for a DDL-formatted text `JdbcUtils` is requested to replace data types in table schema and parseUserSpecifiedCreateTableColumnTypes `JsonExprUtils` is requested to `validateSchemaLiteral`

CatalystSqlParser — DataTypes and StructTypes Parser

2013-06-05admin阅读(2682)

CatalystSqlParser — DataTypes and StructTypes Parser

CatalystSqlParser is a AbstractSqlParser with AstBuilder as the required astBuilder.

CatalystSqlParser is used to translate DataTypes from their canonical string representation (e.g. when adding fields to a schema or casting column to a different data type) or StructTypes.



import org.apache.spark.sql.types.StructType
scala> val struct = new StructType().add("a", "int")
struct: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

scala> val asInt = expr("token = 'hello'").cast("int")
asInt: org.apache.spark.sql.Column = CAST((token = hello) AS INT)

1

2

3

4

5

6

7

8

9

10

import org.apache.spark.sql.types.StructType

scala> val struct = new StructType().add("a", "int")

struct: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

scala> val asInt = expr("token = 'hello'").cast("int")

asInt: org.apache.spark.sql.Column = CAST((token = hello) AS INT)

When parsing, you should see INFO messages in the logs:



INFO CatalystSqlParser: Parsing command: int

1

2

3

4

5

INFO CatalystSqlParser: Parsing command: int

It is also used in HiveClientImpl (when converting columns from Hive to Spark) and in OrcFileOperator (when inferring the schema for ORC files).

Tip

Enable INFO logging level for org.apache.spark.sql.catalyst.parser.CatalystSqlParser logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.catalyst.parser.CatalystSqlParser=INFO

1

2

3

4

5

log4j.logger.org.apache.spark.sql.catalyst.parser.CatalystSqlParser=INFO

Refer to Logging.

AstBuilder — ANTLR-based SQL Parser

2013-06-04admin阅读(2786)

AstBuilder — ANTLR-based SQL Parser

AstBuilder converts SQL statements into Spark SQL’s relational entities (i.e. data types, Catalyst expressions, logical plans or TableIdentifiers) using visit callback methods.

AstBuilder is the AST builder of AbstractSqlParser (i.e. the base SQL parsing infrastructure in Spark SQL).

Tip

Spark SQL supports SQL statements as described in SqlBase.g4. Using the file can tell you (almost) exactly what Spark SQL supports at any given time.

“Almost” being that although the grammar accepts a SQL statement it can be reported as not allowed by AstBuilder, e.g.



scala> sql("EXPLAIN FORMATTED SELECT * FROM myTable").show
org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: EXPLAIN FORMATTED(line 1, pos 0)

== SQL ==
EXPLAIN FORMATTED SELECT * FROM myTable
^^^

  at org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)
  at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitExplain$1.apply(SparkSqlParser.scala:275)
  at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitExplain$1.apply(SparkSqlParser.scala:273)
  at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:93)
  at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitExplain(SparkSqlParser.scala:273)
  at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitExplain(SparkSqlParser.scala:53)
  at org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExplainContext.accept(SqlBaseParser.java:480)
  at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
  at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66)
  at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66)
  at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:93)
  at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:65)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:62)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:61)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:90)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:61)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
  ... 48 elided

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

scala> sql("EXPLAIN FORMATTED SELECT * FROM myTable").show

org.apache.spark.sql.catalyst.parser.ParseException:

Operation not allowed: EXPLAIN FORMATTED(line 1, pos 0)

== SQL ==

EXPLAIN FORMATTED SELECT * FROM myTable

^^^

at org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)

at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitExplain$1.apply(SparkSqlParser.scala:275)

at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitExplain$1.apply(SparkSqlParser.scala:273)

at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:93)

at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitExplain(SparkSqlParser.scala:273)

at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitExplain(SparkSqlParser.scala:53)

at org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExplainContext.accept(SqlBaseParser.java:480)

at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)

at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66)

at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:93)

at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:65)

at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:62)

at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:61)

at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:90)

at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)

at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:61)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)

... 48 elided

AstBuilder is a ANTLR AbstractParseTreeVisitor (as SqlBaseBaseVisitor) that is generated from SqlBase.g4 ANTLR grammar for Spark SQL.

Note	`SqlBaseBaseVisitor` is a ANTLR-specific base class that is auto-generated at build time from a ANTLR grammar in `SqlBase.g4`. `SqlBaseBaseVisitor` is an ANTLR AbstractParseTreeVisitor.

Callback Method ANTLR rule / labeled alternative Spark SQL Entity

visitAliasedQuery

visitColumnReference

visitDereference

visitExists

#exists labeled alternative

Exists expression

visitExplain

explain rule

ExplainCommand

Note

Can be a OneRowRelation for an EXPLAIN for an unexplainable DescribeTableCommand logical command as created from DESCRIBE TABLE SQL statement.



val q = sql("EXPLAIN DESCRIBE TABLE t")
scala> println(q.queryExecution.logical.numberedTreeString)
scala> println(q.queryExecution.logical.numberedTreeString)
00 ExplainCommand OneRowRelation$, false, false, false

1

2

3

4

5

6

7

8

val q = sql("EXPLAIN DESCRIBE TABLE t")

scala> println(q.queryExecution.logical.numberedTreeString)

00 ExplainCommand OneRowRelation$, false, false, false

visitFirst

#first labeled alternative

First aggregate function expression



FIRST '(' expression (IGNORE NULLS)? ')'

1

2

3

4

5

FIRST '(' expression (IGNORE NULLS)? ')'

visitFromClause

fromClause

LogicalPlan

Supports multiple comma-separated relations (that all together build a condition-less INNER JOIN) with optional LATERAL VIEW.

A relation can be one of the following or a combination thereof:

Table identifier
Inline table using VALUES exprs AS tableIdent
Table-valued function (currently only range is supported)

visitFunctionCall

functionCall labeled alternative

UnresolvedFunction for a bare function (with no window specification)
UnresolvedWindowExpression for a function evaluated in a windowed context with a WindowSpecReference
WindowExpression for a function over a window

Tip	See the function examples below.

visitInlineTable

inlineTable rule

UnresolvedInlineTable unary logical operator (as the child of SubqueryAlias for tableAlias)



VALUES expression (',' expression)* tableAlias

1

2

3

4

5

VALUES expression (',' expression)* tableAlias

expression can be as follows:

CreateNamedStruct expression for multiple-column tables
Any Catalyst expression for one-column tables

tableAlias can be specified explicitly or defaults to colN for every column (starting from 1 for N).

visitInsertIntoTable

#insertIntoTable labeled alternative

InsertIntoTable (indirectly)

A 3-element tuple with a TableIdentifier, optional partition keys and the exists flag disabled



INSERT INTO TABLE? tableIdentifier partitionSpec?

1

2

3

4

5

INSERT INTO TABLE? tableIdentifier partitionSpec?

Note	`insertIntoTable` is part of `insertInto` that is in turn used only as a helper labeled alternative in singleInsertQuery and multiInsertQueryBody rules.

visitInsertOverwriteTable

#insertOverwriteTable labeled alternative

InsertIntoTable (indirectly)

A 3-element tuple with a TableIdentifier, optional partition keys and the exists flag



INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)?

1

2

3

4

5

INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)?

In a way, visitInsertOverwriteTable is simply a more general version of the visitInsertIntoTable with the exists flag on or off per IF NOT EXISTS used or not. The main difference is that dynamic partitions are used with no IF NOT EXISTS.

Note	`insertOverwriteTable` is part of `insertInto` that is in turn used only as a helper labeled alternative in singleInsertQuery and multiInsertQueryBody rules.

visitMultiInsertQuery

multiInsertQueryBody

A logical operator with a InsertIntoTable (and UnresolvedRelation leaf operator)



FROM relation (',' relation)* lateralView*
INSERT OVERWRITE TABLE ...

FROM relation (',' relation)* lateralView*
INSERT INTO TABLE? ...

1

2

3

4

5

6

7

8

9

FROM relation (',' relation)* lateralView*

INSERT OVERWRITE TABLE ...

FROM relation (',' relation)* lateralView*

INSERT INTO TABLE? ...

visitNamedExpression

namedExpression

Alias (for a single alias)
MultiAlias (for a parenthesis enclosed alias list
a bare Expression

visitNamedQuery

SubqueryAlias

visitQuerySpecification

querySpecification

OneRowRelation or LogicalPlan

Note

visitQuerySpecification creates a OneRowRelation for a SELECT without a FROM clause.



val q = sql("select 1")
scala> println(q.queryExecution.logical.numberedTreeString)
00 'Project [unresolvedalias(1, None)]
01 +- OneRowRelation$

1

2

3

4

5

6

7

8

val q = sql("select 1")

scala> println(q.queryExecution.logical.numberedTreeString)

00 'Project [unresolvedalias(1, None)]

01 +- OneRowRelation$

visitPredicated

predicated

Expression

visitRelation

relation

LogicalPlan for a FROM clause.

visitRowConstructor

visitSingleDataType

singleDataType

DataType

visitSingleExpression

singleExpression

Expression

Takes the named expression and relays to visitNamedExpression

visitSingleInsertQuery

#singleInsertQuery labeled alternative

A logical operator with a InsertIntoTable



INSERT INTO TABLE? tableIdentifier partitionSpec? #insertIntoTable

INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)? #insertOverwriteTable

1

2

3

4

5

6

7

INSERT INTO TABLE? tableIdentifier partitionSpec? #insertIntoTable

INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)? #insertOverwriteTable

visitSortItem

sortItem

SortOrder unevaluable unary expression



sortItem
    : expression ordering=(ASC

1

2

3

4

5

6

sortItem

: expression ordering=(ASC

DESC)? (NULLS nullOrder=(LAST

FIRST))?
;

ORDER BY order+=sortItem (‘,’ order+=sortItem)*
SORT BY sort+=sortItem (‘,’ sort+=sortItem)*

(ORDER

SORT) BY sortItem (‘,’ sortItem)*)?
`

visitSingleStatement

singleStatement

LogicalPlan from a single statement

Note	A single statement can be quite involved.

visitSingleTableIdentifier

singleTableIdentifier

TableIdentifier

visitStar

#star labeled alternative

UnresolvedStar

visitStruct

visitSubqueryExpression

#subqueryExpression labeled alternative

ScalarSubquery

visitWindowDef

windowDef labeled alternative

WindowSpecDefinition



'(' CLUSTER BY partition+=expression (',' partition+=expression)*) windowFrame? ')'

'(' ((PARTITION | DISTRIBUTE) BY partition+=expression (',' partition+=expression)*)?
  ((ORDER | SORT) BY sortItem (',' sortItem)*)?)
  windowFrame? ')'

1

2

3

4

5

6

7

8

9

'(' CLUSTER BY partition+=expression (',' partition+=expression)*) windowFrame? ')'

'(' ((PARTITION | DISTRIBUTE) BY partition+=expression (',' partition+=expression)*)?

((ORDER | SORT) BY sortItem (',' sortItem)*)?)

windowFrame? ')'

Parsing Handler LogicalPlan Added

withAggregation

GroupingSets for GROUP BY … GROUPING SETS (…)
Aggregate for GROUP BY … (WITH CUBE | WITH ROLLUP)?

withGenerate

Generate with a UnresolvedGenerator and join flag turned on for LATERAL VIEW (in SELECT or FROM clauses).

withHints

Hint for /*+ hint */ in SELECT queries.

Tip	Note `+` (plus) between `/` and `/`

hint is of the format name or name (param1, param2, …).



/*+ BROADCAST (table) */

1

2

3

4

5

/*+ BROADCAST (table) */

withInsertInto

InsertIntoTable for visitSingleInsertQuery or visitMultiInsertQuery
InsertIntoDir for…FIXME

withJoinRelations

Join for a FROM clause and relation alone.

The following join types are supported:

INNER (default)
CROSS
LEFT (with optional OUTER)
LEFT SEMI
RIGHT (with optional OUTER)
FULL (with optional OUTER)
ANTI (optionally prefixed with LEFT)

The following join criteria are supported:

ON booleanExpression
USING '(' identifier (',' identifier)* ')'

Joins can be NATURAL (with no join criteria).

withQueryResultClauses

withQuerySpecification

Adds a query specification to a logical operator.

For transform SELECT (with TRANSFORM, MAP or REDUCE qualifiers), withQuerySpecification does…FIXME

For regular SELECT (no TRANSFORM, MAP or REDUCE qualifiers), withQuerySpecification adds (in that order):

Generate unary logical operators (if used in the parsed SQL text)
Filter unary logical plan (if used in the parsed SQL text)
GroupingSets or Aggregate unary logical operators (if used in the parsed SQL text)
Project and/or Filter unary logical operators
WithWindowDefinition unary logical operator (if used in the parsed SQL text)
UnresolvedHint unary logical operator (if used in the parsed SQL text)

withPredicate

NOT? IN '(' query ')' gives an In predicate expression with a ListQuery subquery expression
NOT? IN '(' expression (',' expression)* ')' gives an In predicate expression

withWindows

WithWindowDefinition for window aggregates (given WINDOW definitions).

Used for withQueryResultClauses and withQuerySpecification with windows definition.



WINDOW identifier AS windowSpec
  (',' identifier AS windowSpec)*

1

2

3

4

5

6

WINDOW identifier AS windowSpec

(',' identifier AS windowSpec)*

Tip	Consult `windows`, `namedWindow`, `windowSpec`, `windowFrame`, and `frameBound` (with `windowRef` and `windowDef`) ANTLR parsing rules for Spark SQL in SqlBase.g4.

Note	`AstBuilder` belongs to `org.apache.spark.sql.catalyst.parser` package.

Function Examples

The examples are handled by visitFunctionCall.



import spark.sessionState.sqlParser

scala> sqlParser.parseExpression("foo()")
res0: org.apache.spark.sql.catalyst.expressions.Expression = 'foo()

scala> sqlParser.parseExpression("foo() OVER windowSpecRef")
res1: org.apache.spark.sql.catalyst.expressions.Expression = unresolvedwindowexpression('foo(), WindowSpecReference(windowSpecRef))

scala> sqlParser.parseExpression("foo() OVER (CLUSTER BY field)")
res2: org.apache.spark.sql.catalyst.expressions.Expression = 'foo() windowspecdefinition('field, UnspecifiedFrame)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

import spark.sessionState.sqlParser

scala> sqlParser.parseExpression("foo()")

res0: org.apache.spark.sql.catalyst.expressions.Expression = 'foo()

scala> sqlParser.parseExpression("foo() OVER windowSpecRef")

res1: org.apache.spark.sql.catalyst.expressions.Expression = unresolvedwindowexpression('foo(), WindowSpecReference(windowSpecRef))

scala> sqlParser.parseExpression("foo() OVER (CLUSTER BY field)")

res2: org.apache.spark.sql.catalyst.expressions.Expression = 'foo() windowspecdefinition('field, UnspecifiedFrame)

`aliasPlan` Internal Method



aliasPlan(alias: ParserRuleContext, plan: LogicalPlan): LogicalPlan

1

2

3

4

5

aliasPlan(alias: ParserRuleContext, plan: LogicalPlan): LogicalPlan

aliasPlan…FIXME

Note	`aliasPlan` is used when…FIXME

`mayApplyAliasPlan` Internal Method



mayApplyAliasPlan(tableAlias: TableAliasContext, plan: LogicalPlan): LogicalPlan

1

2

3

4

5

mayApplyAliasPlan(tableAlias: TableAliasContext, plan: LogicalPlan): LogicalPlan

mayApplyAliasPlan…FIXME

Note	`mayApplyAliasPlan` is used when…FIXME

AbstractSqlParser — Base SQL Parsing Infrastructure

2013-06-03admin阅读(2572)

AbstractSqlParser — Base SQL Parsing Infrastructure

AbstractSqlParser is the base of ParserInterfaces that use an AstBuilder to parse SQL statements and convert them to Spark SQL entities, i.e. DataType, StructType, Expression, LogicalPlan and TableIdentifier.

AbstractSqlParser is the foundation of the SQL parsing infrastructure.



package org.apache.spark.sql.catalyst.parser

abstract class AbstractSqlParser extends ParserInterface {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def astBuilder: AstBuilder
}

1

2

3

4

5

6

7

8

9

10

11

package org.apache.spark.sql.catalyst.parser

abstract class AbstractSqlParser extends ParserInterface {

// only required properties (vals and methods) that have no implementation

// the others follow

def astBuilder: AstBuilder

}

Table 1. AbstractSqlParser Contract
Method	Description
`astBuilder`	AstBuilder for parsing SQL statements. Used in all the `parse` methods, i.e. parseDataType, parseExpression, parsePlan, parseTableIdentifier, and parseTableSchema.

Name Description

SparkSqlParser

The default SQL parser in SessionState available as sqlParser property.



val spark: SparkSession = ...
spark.sessionState.sqlParser

1

2

3

4

5

6

val spark: SparkSession = ...

spark.sessionState.sqlParser

CatalystSqlParser

Creates a DataType or a StructType (schema) from their canonical string representation.

Setting Up SqlBaseLexer and SqlBaseParser for Parsing — `parse` Method



parse[T](command: String)(toResult: SqlBaseParser => T): T

1

2

3

4

5

parse[T](command: String)(toResult: SqlBaseParser => T): T

parse sets up a proper ANTLR parsing infrastructure with SqlBaseLexer and SqlBaseParser (which are the ANTLR-specific classes of Spark SQL that are auto-generated at build time from the SqlBase.g4 grammar).

Tip	Review the definition of ANTLR grammar for Spark SQL in sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4.

Internally, parse first prints out the following INFO message to the logs:



INFO SparkSqlParser: Parsing command: [command]

1

2

3

4

5

INFO SparkSqlParser: Parsing command: [command]

Tip	Enable `INFO` logging level for the custom `AbstractSqlParser`, i.e. SparkSqlParser or CatalystSqlParser, to see the above INFO message.

parse then creates and sets up a SqlBaseLexer and SqlBaseParser that in turn passes the latter on to the input toResult function where the parsing finally happens.

Note	`parse` uses `SLL` prediction mode for parsing first before falling back to `LL` mode.

In case of parsing errors, parse reports a ParseException.

Note	`parse` is used in all the `parse` methods, i.e. parseDataType, parseExpression, parsePlan, parseTableIdentifier, and parseTableSchema.

`parseDataType` Method



parseDataType(sqlText: String): DataType

1

2

3

4

5

parseDataType(sqlText: String): DataType

Note	`parseDataType` is part of ParserInterface Contract to…FIXME.

parseDataType…FIXME

`parseExpression` Method



parseExpression(sqlText: String): Expression

1

2

3

4

5

parseExpression(sqlText: String): Expression

Note	`parseExpression` is part of ParserInterface Contract to…FIXME.

parseExpression…FIXME

`parseFunctionIdentifier` Method



parseFunctionIdentifier(sqlText: String): FunctionIdentifier

1

2

3

4

5

parseFunctionIdentifier(sqlText: String): FunctionIdentifier

Note	`parseFunctionIdentifier` is part of ParserInterface Contract to…FIXME.

parseFunctionIdentifier…FIXME

`parseTableIdentifier` Method



parseTableIdentifier(sqlText: String): TableIdentifier

1

2

3

4

5

parseTableIdentifier(sqlText: String): TableIdentifier

Note	`parseTableIdentifier` is part of ParserInterface Contract to…FIXME.

parseTableIdentifier…FIXME

`parseTableSchema` Method



parseTableSchema(sqlText: String): StructType

1

2

3

4

5

parseTableSchema(sqlText: String): StructType

Note	`parseTableSchema` is part of ParserInterface Contract to…FIXME.

parseTableSchema…FIXME

`parsePlan` Method



parsePlan(sqlText: String): LogicalPlan

1

2

3

4

5

parsePlan(sqlText: String): LogicalPlan

Note	`parsePlan` is part of ParserInterface Contract to…FIXME.

parsePlan creates a LogicalPlan for a given SQL textual statement.

Internally, parsePlan builds a SqlBaseParser and requests AstBuilder to parse a single SQL statement.

If a SQL statement could not be parsed, parsePlan throws a ParseException:



Unsupported SQL statement

1

2

3

4

5

Unsupported SQL statement

SQL Parsing Framework

2013-06-02admin阅读(1082)

SQL Parsing Framework

SQL Parser Framework in Spark SQL uses ANTLR to translate a SQL text to a data type, Expression, TableIdentifier or LogicalPlan.

The contract of the SQL Parser Framework is described by ParserInterface contract. The contract is then abstracted in AbstractSqlParser class so subclasses have to provide custom AstBuilder only.

There are two concrete implementations of AbstractSqlParser:

SparkSqlParser that is the default parser of the SQL expressions into Spark’s types.
CatalystSqlParser that is used to parse data types from their canonical string representation.

spark-sql 第6页

SQLExecution Helper Object

Executing Dataset Action (with Zero or More Spark Jobs) Under New Execution Id — withNewExecutionId Method

Finding QueryExecution for Execution ID — getQueryExecution Method

Executing Action (with Zero or More Spark Jobs) Tracked Under Given Execution Id — withExecutionId Method

SparkSQLEnv

Thrift JDBC/ODBC Server — Spark Thrift Server (STS)

Starting Thrift JDBC/ODBC Server — start-thriftserver.sh

Using Beeline JDBC Client to Connect to Spark Thrift Server

Connecting to Spark Thrift Server using SQuirreL SQL Client 3.7.1

Using Spark SQL’s DataSource API to Connect to Spark Thrift Server

ThriftServerTab — web UI’s Tab for Spark Thrift Server

Stopping Thrift JDBC/ODBC Server — stop-thriftserver.sh

Transport Mode

main method

HiveThriftServer2Listener

SparkSqlParser — Default SQL Parser

Variable Substitution

SparkSqlAstBuilder

ParserInterface — SQL Parser Contract

CatalystSqlParser — DataTypes and StructTypes Parser

AstBuilder — ANTLR-based SQL Parser

Function Examples

aliasPlan Internal Method

mayApplyAliasPlan Internal Method

AbstractSqlParser — Base SQL Parsing Infrastructure

Setting Up SqlBaseLexer and SqlBaseParser for Parsing — parse Method

parseDataType Method

parseExpression Method

parseFunctionIdentifier Method

parseTableIdentifier Method

parseTableSchema Method

parsePlan Method

SQL Parsing Framework

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Executing Dataset Action (with Zero or More Spark Jobs) Under New Execution Id — `withNewExecutionId` Method

Finding QueryExecution for Execution ID — `getQueryExecution` Method

Executing Action (with Zero or More Spark Jobs) Tracked Under Given Execution Id — `withExecutionId` Method

Starting Thrift JDBC/ODBC Server — `start-thriftserver.sh`

`ThriftServerTab` — web UI’s Tab for Spark Thrift Server

Stopping Thrift JDBC/ODBC Server — `stop-thriftserver.sh`

`main` method

`aliasPlan` Internal Method

`mayApplyAliasPlan` Internal Method

Setting Up SqlBaseLexer and SqlBaseParser for Parsing — `parse` Method

`parseDataType` Method

`parseExpression` Method

`parseFunctionIdentifier` Method

`parseTableIdentifier` Method

`parseTableSchema` Method

`parsePlan` Method