FileStreamSink — Streaming Sink for Parquet Format-spark技术分享

FileStreamSink — Streaming Sink for Parquet Format

FileStreamSink is the streaming sink that writes out the results of a streaming query to parquet files.



import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val out = in.
  writeStream.
  format("parquet").
  option("path", "parquet-output-dir").
  option("checkpointLocation", "checkpoint-dir").
  trigger(Trigger.ProcessingTime(10.seconds)).
  outputMode(OutputMode.Append).
  start

import scala.concurrent.duration._

import org.apache.spark.sql.streaming.{OutputMode, Trigger}

val out = in.

writeStream.

format("parquet").

option("path", "parquet-output-dir").

option("checkpointLocation", "checkpoint-dir").

trigger(Trigger.ProcessingTime(10.seconds)).

outputMode(OutputMode.Append).

start

FileStreamSink is created exclusively when DataSource is requested to create a streaming sink.

FileStreamSink supports Append output mode only.

FileStreamSink uses spark.sql.streaming.fileSink.log.deletion (as isDeletingExpiredLog)

The textual representation of FileStreamSink is FileSink[path]

Table 1. FileStreamSink’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`basePath`	FIXME Used when…FIXME
`logPath`	FIXME Used when…FIXME
`fileLog`	FIXME Used when…FIXME
`hadoopConf`	FIXME Used when…FIXME

`addBatch` Method



addBatch(batchId: Long, data: DataFrame): Unit

addBatch(batchId: Long, data: DataFrame): Unit

Note	`addBatch` is a part of Sink Contract to “add” a batch of data to the sink.

addBatch…FIXME

Creating FileStreamSink Instance

FileStreamSink takes the following when created:

SparkSession
Path with the metadata directory
FileFormat
Names of the partition columns
Configuration options

FileStreamSink initializes the internal registries and counters.

FileStreamSink — Streaming Sink for Parquet Format