关注 spark技术分享,
撸spark源码 玩spark最佳实践

MemoryStream

MemoryStream

MemoryStream is a streaming Source that produces values to memory.

MemoryStream uses the internal batches collection of datasets.

Caution

This source is not for production use due to design contraints, e.g. infinite in-memory collection of lines read and no fault recovery.

MemoryStream is designed primarily for unit tests, tutorials and debugging.

Tip

Enable DEBUG logging level for org.apache.spark.sql.execution.streaming.MemoryStream logger to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Creating MemoryStream Instance

MemoryStream object defines apply method that you can use to create instances of MemoryStream streaming sources.

Adding Data to Source (addData methods)

addData methods add the input data to batches internal collection.

When executed, addData adds a DataFrame (created using toDS implicit method) and increments the internal currentOffset offset.

You should see the following DEBUG message in the logs:

Generating Next Streaming Batch — getBatch Method

Note
getBatch is a part of Streaming Source contract.

When executed, getBatch uses the internal batches collection to return requested offsets.

You should see the following DEBUG message in the logs:

StreamingExecutionRelation Logical Plan

MemoryStream uses StreamingExecutionRelation logical plan to build Datasets or DataFrames when requested.

Schema (schema method)

MemoryStream works with the data of the schema as described by the Encoder (of the Dataset).

赞(0) 打赏
未经允许不得转载:spark技术分享 » MemoryStream
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏