Deduplicate Unary Logical Operator
Deduplicate
is a unary logical operator (i.e. LogicalPlan
) that is created to represent dropDuplicates operator (that drops duplicate records for a given subset of columns).
Deduplicate
has streaming flag enabled for streaming Datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
val uniqueRates = spark. readStream. format("rate"). load. dropDuplicates("value") // <-- creates Deduplicate logical operator // Note the streaming flag scala> println(uniqueRates.queryExecution.logical.numberedTreeString) 00 Deduplicate [value#33L], true // <-- streaming flag enabled 01 +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@4785f176,rate,List(),None,List(),None,Map(),None), rate, [timestamp#32, value#33L] |
Caution
|
FIXME Example with duplicates across batches to show that Deduplicate keeps state and withWatermark operator should also be used to limit how much is stored (to not cause OOM)
|
Note
|
The following code is not supported in Structured Streaming and results in an
|
Note
|
|
The output schema of Deduplicate
is exactly the child‘s output schema.