ConsoleSink for Showing DataFrames to Console-spark技术分享

ConsoleSink for Showing DataFrames to Console

ConsoleSink is a streaming sink that shows the DataFrame (for a batch) to the console.

ConsoleSink is registered as console format (by ConsoleSinkProvider).

Table 1. ConsoleSink’s Options
Name	Default Value	Description
`numRows`	`20`	Number of rows to display
`truncate`	`true`	Truncate the data to display to 20 characters



scala> spark.version
res0: String = 2.3.0-SNAPSHOT

import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val query = spark.
  readStream.
  format("rate").
  load.
  writeStream.
  format("console").  // <-- use ConsoleSink
  option("truncate", false).
  option("numRows", 10).
  trigger(Trigger.ProcessingTime(10.seconds)).
  queryName("rate-console").
  start

-------------------------------------------
Batch: 0
-------------------------------------------
+---------+-----+
|timestamp|value|
+---------+-----+
+---------+-----+

scala> spark.version

res0: String = 2.3.0-SNAPSHOT

import org.apache.spark.sql.streaming.{OutputMode, Trigger}

import scala.concurrent.duration._

val query = spark.

readStream.

format("rate").

load.

writeStream.

format("console"). // <-- use ConsoleSink

option("truncate", false).

option("numRows", 10).

trigger(Trigger.ProcessingTime(10.seconds)).

queryName("rate-console").

start

-------------------------------------------

Batch: 0

-------------------------------------------

+---------+-----+

|timestamp|value|

+---------+-----+

Adding Batch (by Showing DataFrame to Console) — `addBatch` Method



addBatch(batchId: Long, data: DataFrame): Unit

addBatch(batchId: Long, data: DataFrame): Unit

Note	`addBatch` is a part of Sink Contract to “add” a batch of data to the sink.

Internally, addBatch records the input batchId in lastBatchId internal property.

addBatch collects the input data DataFrame and creates a brand new DataFrame that it then shows (per numRowsToShow and isTruncated properties).



-------------------------------------------
Batch: [batchId]
-------------------------------------------
+---------+-----+
|timestamp|value|
+---------+-----+
+---------+-----+

-------------------------------------------

Batch: [batchId]

-------------------------------------------

+---------+-----+

|timestamp|value|

+---------+-----+

Note	You may see `Rerun batch:` instead if the input `batchId` is below lastBatchId (likely due to a batch failure).

ConsoleSink for Showing DataFrames to Console