ShuffledRowRDD
ShuffledRowRDD is an RDD of internal binary rows (i.e. RDD[InternalRow]).
|
Note
|
ShuffledRowRDD looks like ShuffledRDD, and the difference is in the type of the values to process, i.e. InternalRow and (K, C) key-value pairs, respectively.
|
ShuffledRowRDD takes a ShuffleDependency (of integer keys and InternalRow values).
|
Note
|
The dependency property is mutable and is of type ShuffleDependency[Int, InternalRow, InternalRow].
|
ShuffledRowRDD takes an optional specifiedPartitionStartIndices collection of integers that is the number of post-shuffle partitions. When not specified, the number of post-shuffle partitions is managed by the Partitioner of the input ShuffleDependency.
|
Note
|
Post-shuffle partition is…FIXME |
| Name | Description |
|---|---|
|
|
A single-element collection with |
|
|
CoalescedPartitioner (with the Partitioner of the |
Computing Partition (in TaskContext) — compute Method
|
1 2 3 4 5 |
compute(split: Partition, context: TaskContext): Iterator[InternalRow] |
|
Note
|
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).
|
Internally, compute makes sure that the input split is a ShuffledRowRDDPartition. It then requests ShuffleManager for a ShuffleReader to read InternalRows for the split.
|
Note
|
compute uses SparkEnv to access the current ShuffleManager.
|
|
Note
|
compute uses ShuffleHandle (of ShuffleDependency dependency) and the pre-shuffle start and end partition offsets.
|
Getting Placement Preferences of Partition — getPreferredLocations Method
|
1 2 3 4 5 |
getPreferredLocations(partition: Partition): Seq[String] |
|
Note
|
getPreferredLocations is part of RDD contract to specify placement preferences (aka preferred task locations), i.e. where tasks should be executed to be as close to the data as possible.
|
Internally, getPreferredLocations requests MapOutputTrackerMaster for the preferred locations of the input partition (for the single ShuffleDependency).
|
Note
|
getPreferredLocations uses SparkEnv to access the current MapOutputTrackerMaster (which runs on the driver).
|
spark技术分享