Streaming Source-spark技术分享

Streaming Data Source

Streaming Data Source is a “continuous” stream of data and is described using the Source Contract.

Source can generate a streaming DataFrame (aka batch) given start and end offsets in a batch.

For fault tolerance, Source must be able to replay data given a start offset.

Source should be able to replay an arbitrary sequence of past data in a stream using a range of offsets. Streaming sources like Apache Kafka and Amazon Kinesis (with their per-record offsets) fit into this model nicely. This is the assumption so structured streaming can achieve end-to-end exactly-once guarantees.

Table 1. Sources
Format	Source
Any `FileFormat` `csv` `hive` `json` `libsvm` `orc` `parquet` `text`	FileStreamSource
`kafka`	KafkaSource
`memory`	MemoryStream
`rate`	RateStreamSource
`socket`	TextSocketSource

Source Contract



package org.apache.spark.sql.execution.streaming

trait Source {
  def commit(end: Offset) : Unit = {}
  def getBatch(start: Option[Offset], end: Offset): DataFrame
  def getOffset: Option[Offset]
  def schema: StructType
  def stop(): Unit
}

package org.apache.spark.sql.execution.streaming

trait Source {

def commit(end: Offset) : Unit = {}

def getBatch(start: Option[Offset], end: Offset): DataFrame

def getOffset: Option[Offset]

def schema: StructType

def stop(): Unit

}

Method Description

getBatch

Generates a DataFrame (with new rows) for a given batch (described using the optional start and end offsets).

Used when StreamExecution runs a batch and populateStartOffsets.

getOffset

Finding the latest offset

Note	Offset is…FIXME

Used exclusively when StreamExecution runs streaming batches (and constructing the next streaming batch for every streaming data source in a streaming Dataset)

schema

Schema of the data from this source

Used when:

KafkaSource generates a DataFrame with records from Kafka for a streaming batch
FileStreamSource generates a DataFrame for a streaming batch
RateStreamSource generates a DataFrame for a streaming batch
StreamingExecutionRelation is created (for MemoryStream)

Streaming Source

Streaming Data Source

Source Contract

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部