关注 spark技术分享,
撸spark源码 玩spark最佳实践

Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

Demo: Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

The demo shows the steps to develop a custom streaming sink and use it to monitor whether and what SQL queries are executed at runtime (using web UI’s SQL tab).

Note

The main motivation was to answer the question Why does a single structured query run multiple SQL queries per batch? that happened to have turned out fairly surprising.

You’re very welcome to upvote the question and answers at your earliest convenience. Thanks!

The steps are as follows:

Findings (aka surprises):

  1. Custom sinks require that you define a checkpoint location using checkpointLocation option (or spark.sql.streaming.checkpointLocation Spark property). Remove the checkpoint directory (or use a different one every start of a streaming query) to have consistent results.

Creating Custom Sink — DemoSink

A streaming sink follows the Sink contract and a sample implementation could look as follows.

Save the file under src/main/scala in your project.

Creating StreamSinkProvider — DemoSinkProvider

Save the file under src/main/scala in your project.

Optional Sink Registration using META-INF/services

The step is optional, but greatly improve the experience when using the custom sink so you can use it by its name (rather than a fully-qualified class name or using a special class name for the sink provider).

Create org.apache.spark.sql.sources.DataSourceRegister in META-INF/services directory with the following content.

Save the file under src/main/resources in your project.

build.sbt Definition

If you use my beloved build tool sbt to manage the project, use the following build.sbt.

Packaging DemoSink

The step depends on what build tool you use to manage the project. Use whatever command you use to create a jar file with the above classes compiled and bundled together.

The jar with the sink is /Users/jacek/dev/sandbox/spark-structured-streaming-demo-sink/target/scala-2.11/spark-structured-streaming-demo-sink_2.11-0.1.jar.

Using DemoSink in Streaming Query

The following code reads data from the rate source and simply outputs the result to our custom DemoSink.

Monitoring SQL Queries using web UI’s SQL Tab

You should find that every trigger (aka batch) results in 3 SQL queries. Why?

webui sql completed queries three per batch.png
Figure 1. web UI’s SQL Tab and Completed Queries (3 Queries per Batch)

The answer lies in what sources and sink a streaming query uses (and differs per streaming query).

In our case, DemoSink collects the rows from the input DataFrame and shows it afterwards. That gives 2 SQL queries (as you can see after executing the following batch queries).

The remaining query (which is the first among the queries) is executed when you load the data.

That can be observed easily when you change DemoSink to not “touch” the input data (in addBatch) in any way.

Re-run the streaming query (using the new DemoSink) and use web UI’s SQL tab to see the queries. You should have just one query per batch (and no Spark jobs given nothing is really done in the sink’s addBatch).

webui sql completed queries one per batch.png
Figure 2. web UI’s SQL Tab and Completed Queries (1 Query per Batch)
赞(0) 打赏
未经允许不得转载:spark技术分享 » Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏