关注 spark技术分享,
撸spark源码 玩spark最佳实践

withWatermark Operator — Event Time Watermark

withWatermark Operator — Event Time Watermark

withWatermark specifies the eventTime column for event time watermark and delayThreshold for event lateness.

eventTime specifies the column to use for watermark and can be either part of Dataset from the source or custom-generated using current_time or current_timestamp functions.

Note
Watermark tracks a point in time before which it is assumed no more late events are supposed to arrive (and if they have, the late events are considered really late and simply dropped).
Note

Spark Structured Streaming uses watermark for the following:

  • To know when a given time window aggregation (using groupBy operator with window function) can be finalized and thus emitted when using output modes that do not allow updates, like Append output mode.

  • To minimize the amount of state that we need to keep for ongoing aggregations, e.g. mapGroupsWithState (for implicit state management), flatMapGroupsWithState (for user-defined state management) and dropDuplicates operators.

The current watermark is computed by looking at the maximum eventTime seen across all of the partitions in a query minus a user-specified delayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least delayThreshold behind the actual event time.

Note
In some cases Spark may still process records that arrive more than delayThreshold late.
赞(0) 打赏
未经允许不得转载:spark技术分享 » withWatermark Operator — Event Time Watermark
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏