关注 spark技术分享,
撸spark源码 玩spark最佳实践

Internals of Streaming Datasets

Internals of Streaming Datasets

Note
The page is to keep notes about how to guide readers through the codebase and may disappear if merged with the other pages or become an intro page.

DataStreamReader and Streaming Data Source

It all starts with SparkSession.readStream method which lets you define a streaming source in a stream processing pipeline (aka streaming processing graph or dataflow graph).

SparkSession.readStream method creates a DataStreamReader.

The fluent API of DataStreamReader allows you to describe the input data source (e.g. DataStreamReader.format and DataStreamReader.options) using method chaining (with the goal of making the readability of the source code close to that of ordinary written prose, essentially creating a domain-specific language within the interface. See Fluent interface article in Wikipedia).

There are a couple of built-in data source formats. Their names are the names of the corresponding DataStreamReader methods and so act like shortcuts of DataStreamReader.format (where you have to specify the format by name), i.e. csv, json, orc, parquet and text, followed by DataStreamReader.load.

You may also want to use DataStreamReader.schema method to specify the schema of the streaming data source.

In the end, you use DataStreamReader.load method that simply creates a streaming Dataset (the good ol’ Dataset that you may have already used in Spark SQL).

The Dataset has the isStreaming property enabled that is basically the only way you could distinguish streaming Datasets from regular, batch Datasets.

In other words, Spark Structured Streaming is designed to extend the features of Spark SQL and let your structured queries be streaming queries.

Data Source Resolution, Streaming Dataset and Logical Query Plan

Being curious about the internals of streaming Datasets is where you start…​seeing numbers not humans (sorry, couldn’t resist drawing the comparison between Matrix the movie and the internals of Spark Structured Streaming).

Whenever you create a Dataset (be it batch in Spark SQL or streaming in Spark Structured Streaming) is when you create a logical query plan using the high-level Dataset DSL.

A logical query plan is made up of logical operators.

Spark Structured Streaming gives you two logical operators to represent streaming sources, i.e. StreamingRelationV2 and StreamingRelation.

When DataStreamReader.load method is executed, load first looks up the requested data source (that you specified using DataStreamReader.format) and creates an instance of it (instantiation). That’d be data source resolution step (that I described in…​FIXME).

DataStreamReader.load is where you can find the intersection of the former Micro-Batch Stream Processing V1 API with the new Continuous Stream Processing V2 API.

For MicroBatchReadSupport or ContinuousReadSupport data sources, DataStreamReader.load creates a logical query plan with a StreamingRelationV2 leaf logical operator. That is the new V2 code path.

For all other types of streaming data sources, DataStreamReader.load creates a logical query plan with a StreamingRelation leaf logical operator. That is the former V1 code path.

Dataset API — High-Level DSL to Build Logical Query Plan

With a streaming Dataset created, you can now use all the methods of Dataset API, including but not limited to the following operators:

Please note that a streaming Dataset is a regular Dataset (with some streaming-related limitations).

The point is to understand that the Dataset API is a domain-specific language (DSL) to build a more sophisticated stream processing pipeline that you could also build using the low-level logical operators directly.

Use Dataset.explain to learn the underlying logical and physical query plans.

Or go pro and talk to QueryExecution directly.

Please note that most of the stream processing operators you may also have used in batch structured queries in Spark SQL. Again, the distinction between Spark SQL and Spark Structured Streaming is very thin from a developer’s point of view.

DataStreamWriter and Streaming Data Sink

Once you’re satisfied with building a stream processing pipeline (using the APIs of DataStreamReader, Dataset, RelationalGroupedDataset and KeyValueGroupedDataset), you should define how and when the result of the streaming query is persisted in (sent out to) an external data system using a streaming sink.

Tip
Find out more on the APIs of Dataset, RelationalGroupedDataset and KeyValueGroupedDataset sections in The Internals of Spark SQL book.

You should use Dataset.writeStream method that simply creates a DataStreamWriter.

The fluent API of DataStreamWriter allows you to describe the output data sink (e.g. DataStreamWriter.format and DataStreamWriter.options) using method chaining (with the goal of making the readability of the source code close to that of ordinary written prose, essentially creating a domain-specific language within the interface. See Fluent interface article in Wikipedia).

Like in DataStreamReader data source formats, there are a couple of built-in data sink formats. Unlike data source formats, their names do not have corresponding DataStreamWriter methods. The reason is that you will use DataStreamWriter.start to create and immediately start a StreamingQuery.

There are however two special output formats that do have corresponding DataStreamWriter methods, i.e. DataStreamWriter.foreach and DataStreamWriter.foreachBatch, that allow for persisting query results to external data systems that do not have streaming sinks available. They give you a trade-off between developing a full-blown streaming sink and simply using the methods (that lay the basis of what a custom sink would have to do anyway).

DataStreamWriter API defines two new concepts (that are not available in the “base” Spark SQL):

You may also want to give a streaming query a name using DataStreamWriter.queryName method.

In the end, you use DataStreamWriter.start method to create and immediately start a StreamingQuery.

When DataStreamWriter is requested to start a streaming query, it allows for the following data source formats:

With a streaming sink, DataStreamWriter requests the StreamingQueryManager to start a streaming query.

StreamingQuery

When a stream processing pipeline is started (using DataStreamWriter.start method), DataStreamWriter creates a StreamingQuery and requests the StreamingQueryManager to start a streaming query.

StreamingQueryManager

StreamingQueryManager is used to manage streaming queries.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Internals of Streaming Datasets
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏