Streaming Data Source
Streaming Data Source is a “continuous” stream of data and is described using the Source Contract.
Source
can generate a streaming DataFrame (aka batch) given start and end offsets in a batch.
For fault tolerance, Source
must be able to replay data given a start offset.
Source
should be able to replay an arbitrary sequence of past data in a stream using a range of offsets. Streaming sources like Apache Kafka and Amazon Kinesis (with their per-record offsets) fit into this model nicely. This is the assumption so structured streaming can achieve end-to-end exactly-once guarantees.
Format | Source |
---|---|
Any
|
|
|
|
|
|
|
|
|
Source Contract
1 2 3 4 5 6 7 8 9 10 11 12 13 |
package org.apache.spark.sql.execution.streaming trait Source { def commit(end: Offset) : Unit = {} def getBatch(start: Option[Offset], end: Offset): DataFrame def getOffset: Option[Offset] def schema: StructType def stop(): Unit } |
Method | Description | ||
---|---|---|---|
Generates a Used when |
|||
Finding the latest offset
Used exclusively when |
|||
Schema of the data from this source Used when:
|