ShuffledRowRDD-spark技术分享

ShuffledRowRDD

ShuffledRowRDD is an RDD of internal binary rows (i.e. RDD[InternalRow]).

Note	`ShuffledRowRDD` looks like ShuffledRDD, and the difference is in the type of the values to process, i.e. InternalRow and `(K, C)` key-value pairs, respectively.

ShuffledRowRDD takes a ShuffleDependency (of integer keys and InternalRow values).

Note	The `dependency` property is mutable and is of type `ShuffleDependency[Int, InternalRow, InternalRow]`.

ShuffledRowRDD takes an optional specifiedPartitionStartIndices collection of integers that is the number of post-shuffle partitions. When not specified, the number of post-shuffle partitions is managed by the Partitioner of the input ShuffleDependency.

Note	Post-shuffle partition is…FIXME

Table 1. ShuffledRowRDD and RDD Contract
Name	Description
`getDependencies`	A single-element collection with `ShuffleDependency[Int, InternalRow, InternalRow]`.
`partitioner`	CoalescedPartitioner (with the Partitioner of the `dependency`)
getPreferredLocations
compute

`numPreShufflePartitions` Property

Caution

FIXME

Computing Partition (in TaskContext) — `compute` Method



compute(split: Partition, context: TaskContext): Iterator[InternalRow]

compute(split: Partition, context: TaskContext): Iterator[InternalRow]

Note	`compute` is part of Spark Core’s `RDD` Contract to compute a partition (in a `TaskContext`).

Internally, compute makes sure that the input split is a ShuffledRowRDDPartition. It then requests ShuffleManager for a ShuffleReader to read InternalRows for the split.

Note	`compute` uses `SparkEnv` to access the current `ShuffleManager`.

Note	`compute` uses `ShuffleHandle` (of ShuffleDependency dependency) and the pre-shuffle start and end partition offsets.

Getting Placement Preferences of Partition — `getPreferredLocations` Method



getPreferredLocations(partition: Partition): Seq[String]

getPreferredLocations(partition: Partition): Seq[String]

Note	`getPreferredLocations` is part of RDD contract to specify placement preferences (aka preferred task locations), i.e. where tasks should be executed to be as close to the data as possible.

Internally, getPreferredLocations requests MapOutputTrackerMaster for the preferred locations of the input partition (for the single ShuffleDependency).

Note	`getPreferredLocations` uses `SparkEnv` to access the current `MapOutputTrackerMaster` (which runs on the driver).

`CoalescedPartitioner`

Caution

FIXME

`ShuffledRowRDDPartition`

Caution

FIXME

ShuffledRowRDD

ShuffledRowRDD

`numPreShufflePartitions` Property

Computing Partition (in TaskContext) — `compute` Method

Getting Placement Preferences of Partition — `getPreferredLocations` Method

`CoalescedPartitioner`

`ShuffledRowRDDPartition`

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

ShuffledRowRDD

numPreShufflePartitions Property

Computing Partition (in TaskContext) — compute Method

Getting Placement Preferences of Partition — getPreferredLocations Method

CoalescedPartitioner

ShuffledRowRDDPartition

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

`numPreShufflePartitions` Property

Computing Partition (in TaskContext) — `compute` Method

Getting Placement Preferences of Partition — `getPreferredLocations` Method

`CoalescedPartitioner`

`ShuffledRowRDDPartition`