关注 spark技术分享,
撸spark源码 玩spark最佳实践

StreamingDeduplicateExec Unary Physical Operator for Streaming Deduplication

StreamingDeduplicateExec Unary Physical Operator for Streaming Deduplication

StreamingDeduplicateExec is a unary physical operator (i.e. UnaryExecNode) that writes state to StateStore with support for streaming watermark.

StreamingDeduplicateExec is created exclusively when StreamingDeduplicationStrategy plans Deduplicate unary logical operators.

StreamingDeduplicateExec StreamingDeduplicationStrategy.png
Figure 1. StreamingDeduplicateExec and StreamingDeduplicationStrategy

StreamingDeduplicateExec uses the performance metrics of StateStoreWriter.

StreamingDeduplicateExec webui query details.png
Figure 2. StreamingDeduplicateExec in web UI (Details for Query)

The output schema of StreamingDeduplicateExec is exactly the child‘s output schema.

The output partitioning of StreamingDeduplicateExec is exactly the child‘s output partitioning.

Tip

Enable INFO logging level for org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Executing Physical Operator — doExecute Method

Note
doExecute is a part of SparkPlan contract to produce the result of a physical operator as an RDD of internal binary rows (i.e. InternalRow).

Internally, doExecute initializes metrics.

doExecute executes child physical operator and creates a StateStoreRDD with storeUpdateFunction that:

  1. Generates an unsafe projection to access the key field (using keyExpressions and the output schema of child).

  2. Filters out rows from Iterator[InternalRow] that match watermarkPredicateForData (when defined and timeoutConf is EventTimeTimeout)

  3. For every row (as InternalRow)

    • Extracts the key from the row (using the unsafe projection above)

    • Gets the saved state in StateStore for the key

    • (when there was a state for the key in the row) Filters out (aka drops) the row

    • (when there was no state for the key in the row) Stores a new (and empty) state for the key and increments numUpdatedStateRows and numOutputRows metrics.

  4. In the end, storeUpdateFunction creates a CompletionIterator that executes a completion function (aka completionFunction) after it has successfully iterated through all the elements (i.e. when a client has consumed all the rows).

    The completion function does the following:

Creating StreamingDeduplicateExec Instance

StreamingDeduplicateExec takes the following when created:

赞(0) 打赏
未经允许不得转载:spark技术分享 » StreamingDeduplicateExec Unary Physical Operator for Streaming Deduplication
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏