关注 spark技术分享,
撸spark源码 玩spark最佳实践

Thrift JDBC/ODBC Server — Spark Thrift Server (STS)

Thrift JDBC/ODBC Server — Spark Thrift Server (STS)

Thrift JDBC/ODBC Server (aka Spark Thrift Server or STS) is Spark SQL’s port of Apache Hive’s HiveServer2 that allows JDBC/ODBC clients to execute SQL queries over JDBC and ODBC protocols on Apache Spark.

With Spark Thrift Server, business users can work with their shiny Business Intelligence (BI) tools, e.g. Tableau or Microsoft Excel, and connect to Apache Spark using the ODBC interface. That brings the in-memory distributed capabilities of Spark SQL’s query engine (with all the Catalyst query optimizations you surely like very much) to environments that were initially “disconnected”.

Beside, SQL queries in Spark Thrift Server share the same SparkContext that helps further improve performance of SQL queries using the same data sources.

Spark Thrift Server is a Spark standalone application that you start using start-thriftserver.sh and stop using stop-thriftserver.sh shell scripts.

Spark Thrift Server has its own tab in web UI — JDBC/ODBC Server available at /sqlserver URL.

spark thriftserver webui.png
Figure 1. Spark Thrift Server’s web UI

Spark Thrift Server can work in HTTP or binary transport modes.

Use beeline command-line tool or SQuirreL SQL Client or Spark SQL’s DataSource API to connect to Spark Thrift Server through the JDBC interface.

Spark Thrift Server extends spark-submit‘s command-line options with --hiveconf [prop=value].

Important

You have to enable hive-thriftserver build profile to include Spark Thrift Server in your build.

Tip

Enable INFO or DEBUG logging levels for org.apache.spark.sql.hive.thriftserver and org.apache.hive.service.server loggers to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Starting Thrift JDBC/ODBC Server — start-thriftserver.sh

You can start Thrift JDBC/ODBC Server using ./sbin/start-thriftserver.sh shell script.

With INFO logging level enabled, when you execute the script you should see the following INFO messages in the logs:

Internally, start-thriftserver.sh script submits org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 standalone application for execution (using spark-submit).

Tip
Using the more explicit approach with spark-submit to start Spark Thrift Server could be easier to trace execution by seeing the logs printed out to the standard output and hence terminal directly.

Using Beeline JDBC Client to Connect to Spark Thrift Server

beeline is a command-line tool that allows you to access Spark Thrift Server using the JDBC interface on command line. It is included in the Spark distribution in bin directory.

You can connect to Spark Thrift Server using connect command as follows:

When connecting in non-secure mode, simply enter the username on your machine and a blank password.

Once connected, you can send SQL queries (as if Spark SQL were a JDBC-compliant database).

Connecting to Spark Thrift Server using SQuirreL SQL Client 3.7.1

Spark Thrift Server allows for remote access to Spark SQL using JDBC protocol.

Note
This section was tested with SQuirreL SQL Client 3.7.1 (squirrelsql-3.7.1-standard.zip) on Mac OS X.

SQuirreL SQL Client is a Java SQL client for JDBC-compliant databases.

Run the client using java -jar squirrel-sql.jar.

spark thriftserver squirrel.png
Figure 2. SQuirreL SQL Client

You first have to configure a JDBC driver for Spark Thrift Server. Spark Thrift Server uses org.spark-project.hive:hive-jdbc:1.2.1.spark2 dependency that is the JDBC driver (that also downloads transitive dependencies).

Tip
The Hive JDBC Driver, i.e. hive-jdbc-1.2.1.spark2.jar and other jar files are in jars directory of the Apache Spark distribution (or assembly/target/scala-2.11/jars for local builds).
Table 1. SQuirreL SQL Client’s Connection Parameters
Parameter Description

Name

Spark Thrift Server

Example URL

jdbc:hive2://localhost:10000

Extra Class Path

All the jar files of your Spark distribution

Class Name

org.apache.hive.jdbc.HiveDriver

spark thriftserver squirrel adddriver.png
Figure 3. Adding Hive JDBC Driver in SQuirreL SQL Client

With the Hive JDBC Driver defined, you can connect to Spark SQL Thrift Server.

spark thriftserver squirrel addalias.png
Figure 4. Adding Hive JDBC Driver in SQuirreL SQL Client

Since you did not specify the database to use, Spark SQL’s default is used.

spark thriftserver squirrel metadata.png
Figure 5. SQuirreL SQL Client Connected to Spark Thrift Server (Metadata Tab)

Below is show tables SQL query in SQuirrel SQL Client executed in Spark SQL through Spark Thrift Server.

spark thriftserver squirrel show tables.png
Figure 6. show tables SQL Query in SQuirrel SQL Client using Spark Thrift Server

Using Spark SQL’s DataSource API to Connect to Spark Thrift Server

What might seem a quite artificial setup at first is accessing Spark Thrift Server using Spark SQL’s DataSource API, i.e. DataFrameReader‘s jdbc method.

Tip

When executed in local mode, Spark Thrift Server and spark-shell will try to access the same Hive Warehouse’s directory that will inevitably lead to an error.

Use spark.sql.warehouse.dir to point to another directory for spark-shell.

You should also not share the same home directory between them since metastore_db becomes an issue.

  1. Connect to Spark Thrift Server at localhost on port 10000

  2. Use people table. It assumes that people table is available.

ThriftServerTab — web UI’s Tab for Spark Thrift Server

ThriftServerTab is…​FIXME

Caution
FIXME Elaborate

Stopping Thrift JDBC/ODBC Server — stop-thriftserver.sh

You can stop a running instance of Thrift JDBC/ODBC Server using ./sbin/stop-thriftserver.sh shell script.

With DEBUG logging level enabled, you should see the following messages in the logs:

Tip
You can also send SIGTERM signal to the process of Thrift JDBC/ODBC Server, i.e. kill [PID] that triggers the same sequence of shutdown steps as stop-thriftserver.sh.

Transport Mode

Spark Thrift Server can be configured to listen in two modes (aka transport modes):

  1. Binary mode — clients should send thrift requests in binary

  2. HTTP mode — clients send thrift requests over HTTP.

You can control the transport modes using
HIVE_SERVER2_TRANSPORT_MODE=http or hive.server2.transport.mode (default: binary). It can be binary (default) or http.

main method

Thrift JDBC/ODBC Server is a Spark standalone application that you…​

Caution
FIXME

HiveThriftServer2Listener

Caution
FIXME
赞(0) 打赏
未经允许不得转载:spark技术分享 » Thrift JDBC/ODBC Server — Spark Thrift Server (STS)
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏