关注 spark技术分享,
撸spark源码 玩spark最佳实践

ShuffledRDD

ShuffledRDD

ShuffledRDD is an RDD of key-value pairs that represents the shuffle step in a RDD lineage. It uses custom ShuffledRDDPartition partitions.

A ShuffledRDD is created for RDD transformations that trigger a data shuffling:

  1. coalesce transformation (with shuffle flag enabled).

  2. PairRDDFunctions‘s combineByKeyWithClassTag and partitionBy (when the parent RDD’s and specified Partitioners are different).

  3. OrderedRDDFunctions‘s sortByKey and repartitionAndSortWithinPartitions ordered operators.

ShuffledRDD takes a parent RDD and a Partitioner when created.

getDependencies returns a single-element collection of RDD dependencies with a ShuffleDependency (with the Serializer according to map-side combine internal flag).

Map-Side Combine mapSideCombine Internal Flag

mapSideCombine internal flag is used to select the Serializer (for shuffling) when ShuffleDependency is created (which is the one and only Dependency of a ShuffledRDD).

Note
mapSideCombine is only used when userSpecifiedSerializer optional Serializer is not specified explicitly (which is the default).

If enabled (i.e. true), mapSideCombine directs to find the Serializer for the types K and C. Otherwise, getDependencies finds the Serializer for the types K and V.

Note
The types K, C and V are specified when ShuffledRDD is created.
Note

mapSideCombine is disabled (i.e. false) when ShuffledRDD is created and can be set using setMapSideCombine method.

setMapSideCombine method is only used in the experimental PairRDDFunctions.combineByKeyWithClassTag transformations.

Computing Partition (in TaskContext) — compute Method

Note
compute is part of RDD Contract to compute a partition (in a TaskContext).

Internally, compute makes sure that the input split is a ShuffleDependency. It then requests ShuffleManager for a ShuffleReader to read key-value pairs (as Iterator[(K, C)]) for the split.

Note
A Partition has the index property to specify startPartition and endPartition partition offsets.

Getting Placement Preferences of Partition — getPreferredLocations Method

Note
getPreferredLocations is part of RDD contract to specify placement preferences (aka preferred task locations), i.e. where tasks should be executed to be as close to the data as possible.

Internally, getPreferredLocations requests MapOutputTrackerMaster for the preferred locations, i.e. BlockManagers with the most map outputs, for the input partition (of the one and only ShuffleDependency).

Note
getPreferredLocations uses SparkEnv to access the current MapOutputTrackerMaster (which runs on the driver).

ShuffledRDDPartition

ShuffledRDDPartition gets an index when it is created (that in turn is the index of partitions as calculated by the Partitioner of a ShuffledRDD).

赞(0) 打赏
未经允许不得转载:spark技术分享 » ShuffledRDD
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏