关注 spark技术分享,
撸spark源码 玩spark最佳实践

StatefulAggregationStrategy Execution Planning Strategy for EventTimeWatermark and Aggregate Logical Operators

StatefulAggregationStrategy Execution Planning Strategy for EventTimeWatermark and Aggregate Logical Operators

StatefulAggregationStrategy is an execution planning strategy (i.e. Strategy) that IncrementalExecution uses to plan EventTimeWatermark and Aggregate logical operators in streaming structured queries (Datasets).

Note

EventTimeWatermark logical operator is the result of withWatermark operator.

Note
Aggregate logical operator represents groupBy and groupByKey aggregations (and SQL’s GROUP BY clause).

StatefulAggregationStrategy is available using SessionState.

Table 1. StatefulAggregationStrategy’s Logical to Physical Operator Conversions
Logical Operator Physical Operator

EventTimeWatermark

EventTimeWatermarkExec

Aggregate

In the order of preference:

  1. HashAggregateExec

  2. ObjectHashAggregateExec

  3. SortAggregateExec

Tip
Read Aggregation Execution Planning Strategy for Aggregate Physical Operators in Mastering Apache Spark 2 gitbook.

Selecting Aggregate Physical Operator Given Aggregate Expressions — AggUtils.planStreamingAggregation Internal Method

planStreamingAggregation takes the grouping attributes (from groupingExpressions).

Note
groupingExpressions corresponds to the grouping function in groupBy operator.

planStreamingAggregation creates an aggregate physical operator (called partialAggregate) with:

  • requiredChildDistributionExpressions undefined (i.e. None)

  • initialInputBufferOffset as 0

  • functionsWithoutDistinct in Partial mode

  • child operator as the input child

Note

planStreamingAggregation creates one of the following aggregate physical operators (in the order of preference):

  1. HashAggregateExec

  2. ObjectHashAggregateExec

  3. SortAggregateExec

planStreamingAggregation uses AggUtils.createAggregate method to select an aggregate physical operator that you can read about in Selecting Aggregate Physical Operator Given Aggregate Expressions — AggUtils.createAggregate Internal Method in Mastering Apache Spark 2 gitbook.

planStreamingAggregation creates an aggregate physical operator (called partialMerged1) with:

  • requiredChildDistributionExpressions based on the input groupingExpressions

  • initialInputBufferOffset as the length of groupingExpressions

  • functionsWithoutDistinct in PartialMerge mode

  • child operator as partialAggregate aggregate physical operator created above

planStreamingAggregation creates StateStoreRestoreExec with the grouping attributes, undefined StatefulOperatorStateInfo, and partialMerged1 aggregate physical operator created above.

planStreamingAggregation creates an aggregate physical operator (called partialMerged2) with:

Note
The only difference between partialMerged1 and partialMerged2 steps is the child physical operator.

planStreamingAggregation creates StateStoreSaveExec with:

  • the grouping attributes based on the input groupingExpressions

  • No stateInfo, outputMode and eventTimeWatermark

  • child operator as partialMerged2 aggregate physical operator created above

In the end, planStreamingAggregation creates the final aggregate physical operator (called finalAndCompleteAggregate) with:

  • requiredChildDistributionExpressions based on the input groupingExpressions

  • initialInputBufferOffset as the length of groupingExpressions

  • functionsWithoutDistinct in Final mode

  • child operator as StateStoreSaveExec physical operator created above

Note
planStreamingAggregation is used exclusively when StatefulAggregationStrategy plans a streaming aggregation.
赞(0) 打赏
未经允许不得转载:spark技术分享 » StatefulAggregationStrategy Execution Planning Strategy for EventTimeWatermark and Aggregate Logical Operators
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏