关注 spark技术分享,
撸spark源码 玩spark最佳实践

FlatMapGroupsWithStateExec Unary Physical Operator

FlatMapGroupsWithStateExec Unary Physical Operator

FlatMapGroupsWithStateExec is a unary physical operator that represents a FlatMapGroupsWithState logical operator at execution time.

Note
A unary physical operator is a physical operator with a single child physical operator.

FlatMapGroupsWithStateExec is created when FlatMapGroupsWithStateStrategy execution planning strategy is requested to plan a streaming query with FlatMapGroupsWithState logical operators for execution.

FlatMapGroupsWithStateExec is an ObjectProducerExec physical operator with the output object attribute.

FlatMapGroupsWithStateExec is a physical operator that supports streaming watermark.

FlatMapGroupsWithStateExec uses the performance metrics of StateStoreWriter.

FlatMapGroupsWithStateExec webui query details.png
Figure 1. FlatMapGroupsWithStateExec in web UI (Details for Query)
Note

FlatMapGroupsWithStateStrategy converts FlatMapGroupsWithState unary logical operator to FlatMapGroupsWithStateExec physical operator with undefined StatefulOperatorStateInfo, batchTimestampMs, and eventTimeWatermark.

StatefulOperatorStateInfo, batchTimestampMs, and eventTimeWatermark are defined when IncrementalExecution query execution pipeline is requested to apply the physical plan preparation rules.

When executed, FlatMapGroupsWithStateExec requires that the optional values are properly defined given timeoutConf:

Caution
FIXME Where are the optional values defined?
Table 1. FlatMapGroupsWithStateExec’s Internal Registries and Counters
Name Description

isTimeoutEnabled

stateAttributes

stateDeserializer

stateManager

stateSerializer

timestampTimeoutAttribute

watermarkPresent

Flag that says whether the child physical operator has a watermark attribute (among the output attributes).

Used exclusively when InputProcessor is requested to callFunctionAndUpdateState

Tip

Enable INFO logging level for org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

keyExpressions Method

Note
keyExpressions is part of the WatermarkSupport Contract to…​FIXME.

keyExpressions simply returns the grouping attributes.

Executing FlatMapGroupsWithStateExec — doExecute Method

Note
doExecute is part of the SparkPlan contract to produce the result of a physical operator as an RDD of internal binary rows (i.e. InternalRow).

Internally, doExecute initializes metrics.

doExecute then executes child physical operator and creates a StateStoreRDD with storeUpdateFunction that:

  1. Creates a StateStoreUpdater

  2. Filters out rows from Iterator[InternalRow] that match watermarkPredicateForData (when defined and timeoutConf is EventTimeTimeout)

  3. Generates an output Iterator[InternalRow] with elements from StateStoreUpdater‘s updateStateForKeysWithData and updateStateForTimedOutKeys

  4. In the end, storeUpdateFunction creates a CompletionIterator that executes a completion function (aka completionFunction) after it has successfully iterated through all the elements (i.e. when a client has consumed all the rows). The completion method requests StateStore to commit followed by updating numTotalStateRows metric with the number of keys in the state store.

Creating FlatMapGroupsWithStateExec Instance

FlatMapGroupsWithStateExec takes the following when created:

  • State function ((Any, Iterator[Any], LogicalGroupState[Any]) ⇒ Iterator[Any])

  • Key deserializer expression

  • Value deserializer expression

  • Grouping attributes (as used for grouping in KeyValueGroupedDataset for mapGroupsWithState or flatMapGroupsWithState operators)

  • Data attributes

  • Output object attribute (that is the reference to the single object field this operator outputs)

  • StatefulOperatorStateInfo

  • State encoder (ExpressionEncoder[Any])

  • State format version

  • OutputMode

  • GroupStateTimeout

  • batchTimestampMs

  • Event time watermark

  • Child physical operator

FlatMapGroupsWithStateExec initializes the internal registries and counters.

shouldRunAnotherBatch Method

Note
shouldRunAnotherBatch is part of the StateStoreWriter Contract to…​FIXME.

shouldRunAnotherBatch…​FIXME

赞(0) 打赏
未经允许不得转载:spark技术分享 » FlatMapGroupsWithStateExec Unary Physical Operator
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏