ForeachSink-spark技术分享

ForeachSink

ForeachSink is a typed streaming sink that passes rows (of the type T) to ForeachWriter (one record at a time per partition).

Note	`ForeachSink` is assigned a `ForeachWriter` when `DataStreamWriter` is started.

ForeachSink is used exclusively in foreach operator.



val records = spark.
  readStream
  format("text").
  load("server-logs/*.out").
  as[String]

import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
  override def open(partitionId: Long, version: Long) = true
  override def process(value: String) = println(value)
  override def close(errorOrNull: Throwable) = {}
}

records.writeStream
  .queryName("server-logs processor")
  .foreach(writer)
  .start

val records = spark.

readStream

format("text").

load("server-logs/*.out").

as[String]

import org.apache.spark.sql.ForeachWriter

val writer = new ForeachWriter[String] {

override def open(partitionId: Long, version: Long) = true

override def process(value: String) = println(value)

override def close(errorOrNull: Throwable) = {}

}

records.writeStream

.queryName("server-logs processor")

.foreach(writer)

.start

Internally, addBatch (the only method from the Sink Contract) takes records from the input DataFrame (as data), transforms them to expected type T (of this ForeachSink) and (now as a Dataset) processes each partition.



addBatch(batchId: Long, data: DataFrame): Unit

addBatch(batchId: Long, data: DataFrame): Unit

addBatch then opens the constructor’s ForeachWriter (for the current partition and the input batch) and passes the records to process (one at a time per partition).

Caution

FIXME Why does Spark track whether the writer failed or not? Why couldn’t it finally and do close?

Caution

FIXME Can we have a constant for "foreach" for source in DataStreamWriter?

ForeachSink

ForeachSink

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部