ShuffledRowRDD
ShuffledRowRDD
is an RDD
of internal binary rows (i.e. RDD[InternalRow]
).
Note
|
ShuffledRowRDD looks like ShuffledRDD, and the difference is in the type of the values to process, i.e. InternalRow and (K, C) key-value pairs, respectively.
|
ShuffledRowRDD
takes a ShuffleDependency (of integer keys and InternalRow values).
Note
|
The dependency property is mutable and is of type ShuffleDependency[Int, InternalRow, InternalRow] .
|
ShuffledRowRDD
takes an optional specifiedPartitionStartIndices
collection of integers that is the number of post-shuffle partitions. When not specified, the number of post-shuffle partitions is managed by the Partitioner of the input ShuffleDependency
.
Note
|
Post-shuffle partition is…FIXME |
Name | Description |
---|---|
|
A single-element collection with |
|
CoalescedPartitioner (with the Partitioner of the |
Computing Partition (in TaskContext) — compute
Method
1 2 3 4 5 |
compute(split: Partition, context: TaskContext): Iterator[InternalRow] |
Note
|
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext ).
|
Internally, compute
makes sure that the input split
is a ShuffledRowRDDPartition. It then requests ShuffleManager
for a ShuffleReader
to read InternalRow
s for the split
.
Note
|
compute uses SparkEnv to access the current ShuffleManager .
|
Note
|
compute uses ShuffleHandle (of ShuffleDependency dependency) and the pre-shuffle start and end partition offsets.
|
Getting Placement Preferences of Partition — getPreferredLocations
Method
1 2 3 4 5 |
getPreferredLocations(partition: Partition): Seq[String] |
Note
|
getPreferredLocations is part of RDD contract to specify placement preferences (aka preferred task locations), i.e. where tasks should be executed to be as close to the data as possible.
|
Internally, getPreferredLocations
requests MapOutputTrackerMaster
for the preferred locations of the input partition
(for the single ShuffleDependency).
Note
|
getPreferredLocations uses SparkEnv to access the current MapOutputTrackerMaster (which runs on the driver).
|