关注 spark技术分享,
撸spark源码 玩spark最佳实践

StateStoreRDD — RDD for Updating State (in StateStores Across Spark Cluster)

StateStoreRDD — RDD for Updating State (in StateStores Across Spark Cluster)

StateStoreRDD is an RDD for executing storeUpdateFunction with StateStore (and data from partitions of a new batch RDD).

StateStoreRDD is created when the following stateful physical operators are executed (using StateStoreOps.mapPartitionsWithStateStore):

StateStoreRDD SparkPlans LogicalPlans operators.png
Figure 1. StateStoreRDD, Physical and Logical Plans, and operators

StateStoreRDD uses StateStoreCoordinator for preferred locations for job scheduling.

StateStoreRDD StateStoreCoordinator.png
Figure 2. StateStoreRDD and StateStoreCoordinator

getPartitions is exactly the partitions of the data RDD.

Table 1. StateStoreRDD’s Internal Registries and Counters
Name Description

hadoopConfBroadcast

storeConf

Configuration parameters (as StateStoreConf) using the current SQLConf (from SessionState)

Computing Partition (in TaskContext) — compute Method

Note
compute is a part of the RDD Contract to compute a given partition in a TaskContext.

compute computes dataRDD passing the result on to storeUpdateFunction (with a configured StateStore).

Internally, (and similarly to getPreferredLocations) compute creates a StateStoreProviderId with StateStoreId (using checkpointLocation, operatorId and the index of the input partition) and queryRunId.

compute then requests StateStore for the store for the StateStoreProviderId.

In the end, compute computes dataRDD (using the input partition and ctxt) followed by executing storeUpdateFunction (with the store and the result).

Getting Placement Preferences of Partition — getPreferredLocations Method

Note
getPreferredLocations is a part of the RDD Contract to specify placement preferences (aka preferred task locations), i.e. where tasks should be executed to be as close to the data as possible.

getPreferredLocations creates a StateStoreProviderId with StateStoreId (using checkpointLocation, operatorId and the index of the input partition) and queryRunId.

Note
checkpointLocation and operatorId are shared across different partitions and so the only difference in StateStoreProviderIds is the partition index.

In the end, getPreferredLocations requests StateStoreCoordinatorRef for the location of the state store for StateStoreProviderId.

Note
StateStoreCoordinator coordinates instances of StateStores across Spark executors in the cluster, and tracks their locations for job scheduling.

Creating StateStoreRDD Instance

StateStoreRDD takes the following when created:

  • RDD with the new streaming batch data (to update the aggregates in a state store)

  • Store update function (i.e. (StateStore, Iterator[T]) ⇒ Iterator[U] with T being the type of the new batch data)

  • The path to the checkpoint location

  • queryRunId

  • Operator id

  • Store version

  • Schema of the keys

  • Schema of the values

  • Optional index

  • SessionState

  • Optional StateStoreCoordinatorRef

StateStoreRDD initializes the internal registries and counters.

赞(0) 打赏
未经允许不得转载:spark技术分享 » StateStoreRDD — RDD for Updating State (in StateStores Across Spark Cluster)
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏