Streaming Operators / Streaming Dataset API-spark技术分享

Streaming Operators / Streaming Dataset API

Dataset API has a set of operators that are of particular use in Apache Spark’s Structured Streaming that together constitute so-called Streaming Dataset API.

Operator Description

dropDuplicates



dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

dropDuplicates(): Dataset[T]

dropDuplicates(colNames: Seq[String]): Dataset[T]

dropDuplicates(col1: String, cols: String*): Dataset[T]

Drops duplicate records (given a subset of columns)

explain



explain(): Unit
explain(extended: Boolean): Unit

explain(): Unit

explain(extended: Boolean): Unit

Explains query plans

groupBy



groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy(cols: Column*): RelationalGroupedDataset

groupBy(col1: String, cols: String*): RelationalGroupedDataset

Aggregates rows by a untyped grouping function

groupByKey



groupByKey(func: T => K): KeyValueGroupedDataset[K, T]

groupByKey(func: T => K): KeyValueGroupedDataset[K, T]

Aggregates rows by a typed grouping function

withWatermark



withWatermark(eventTime: String, delayThreshold: String): Dataset[T]

withWatermark(eventTime: String, delayThreshold: String): Dataset[T]

Defines a streaming watermark for late events (on the given eventTime column)

writeStream



writeStream: DataStreamWriter[T]

writeStream: DataStreamWriter[T]

Creates a DataStreamWriter for persisting the result of a streaming query to an external data system



val rates = spark
  .readStream
  .format("rate")
  .option("rowsPerSecond", 1)
  .load

// stream processing
// replace [operator] with the operator of your choice
rates.[operator]

// output stream
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val sq = rates
  .writeStream
  .format("console")
  .option("truncate", false)
  .trigger(Trigger.ProcessingTime(10.seconds))
  .outputMode(OutputMode.Complete)
  .queryName("rate-console")
  .start

// eventually...
sq.stop

val rates = spark

.readStream

.format("rate")

.option("rowsPerSecond", 1)

.load

// stream processing

// replace [operator] with the operator of your choice

rates.[operator]

// output stream

import org.apache.spark.sql.streaming.{OutputMode, Trigger}

import scala.concurrent.duration._

val sq = rates

.writeStream

.format("console")

.option("truncate", false)

.trigger(Trigger.ProcessingTime(10.seconds))

.outputMode(OutputMode.Complete)

.queryName("rate-console")

.start

// eventually...

sq.stop

Streaming Operators / Streaming Dataset API

Streaming Operators / Streaming Dataset API

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部