关注 spark技术分享,
撸spark源码 玩spark最佳实践

Streaming Source

Streaming Data Source

Streaming Data Source is a “continuous” stream of data and is described using the Source Contract.

Source can generate a streaming DataFrame (aka batch) given start and end offsets in a batch.

For fault tolerance, Source must be able to replay data given a start offset.

Source should be able to replay an arbitrary sequence of past data in a stream using a range of offsets. Streaming sources like Apache Kafka and Amazon Kinesis (with their per-record offsets) fit into this model nicely. This is the assumption so structured streaming can achieve end-to-end exactly-once guarantees.

Table 1. Sources
Format Source

Any FileFormat

  • csv

  • hive

  • json

  • libsvm

  • orc

  • parquet

  • text

FileStreamSource

kafka

KafkaSource

memory

MemoryStream

rate

RateStreamSource

socket

TextSocketSource

Source Contract

Table 2. Source Contract
Method Description

getBatch

Generates a DataFrame (with new rows) for a given batch (described using the optional start and end offsets).

Used when StreamExecution runs a batch and populateStartOffsets.

getOffset

Finding the latest offset

Note
Offset is…​FIXME

Used exclusively when StreamExecution runs streaming batches (and constructing the next streaming batch for every streaming data source in a streaming Dataset)

schema

Schema of the data from this source

Used when:

赞(0) 打赏
未经允许不得转载:spark技术分享 » Streaming Source
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏