RDD
RDD
is a description of a distributed computation over dataset of records of type T
.
RDD
is identified by a unique identifier (aka RDD ID) that is unique among all RDDs in a SparkContext.
1 2 3 4 5 |
id: Int |
RDD
has a storage level that…FIXME
1 2 3 4 5 |
storageLevel: StorageLevel |
The storage level of an RDD is StorageLevel.NONE by default which is…FIXME
Getting Or Computing RDD Partition — getOrCompute
Method
1 2 3 4 5 |
getOrCompute(partition: Partition, context: TaskContext): Iterator[T] |
getOrCompute
creates a RDDBlockId for the RDD id and the partition index.
getOrCompute
requests the BlockManager
to getOrElseUpdate for the block ID (with the storage level and the makeIterator
function).
Note
|
getOrCompute uses SparkEnv to access the current BlockManager.
|
getOrCompute
records whether…FIXME (readCachedBlock)
getOrCompute
branches off per the response from the BlockManager and whether the internal readCachedBlock
flag is now on or still off. In either case, getOrCompute
creates an InterruptibleIterator.
Note
|
InterruptibleIterator simply delegates to a wrapped internal Iterator , but allows for task killing functionality.
|
For a BlockResult
available and readCachedBlock
flag on, getOrCompute
…FIXME
For a BlockResult
available and readCachedBlock
flag off, getOrCompute
…FIXME
Note
|
The BlockResult could be found in a local block manager or fetched from a remote block manager. It may also have been stored (persisted) just now. In either case, the BlockResult is available (and BlockManager.getOrElseUpdate gives a Left value with the BlockResult ).
|
Note
|
BlockManager.getOrElseUpdate gives a Right(iter) value to indicate an error with a block.
|
Note
|
getOrCompute is used on Spark executors.
|
Note
|
getOrCompute is a private[spark] method that is exclusively used when iterating over partition when a RDD is cached.
|
Computing Partition (in TaskContext) — compute
Method
1 2 3 4 5 |
compute(split: Partition, context: TaskContext): Iterator[T] |
The abstract compute
method computes the input split
partition in the TaskContext to produce a collection of values (of type T
).
compute
is implemented by any type of RDD in Spark and is called every time the records are requested unless RDD is cached or checkpointed (and the records can be read from an external storage, but this time closer to the compute node).
When an RDD is cached, for specified storage levels (i.e. all but NONE
) CacheManager
is requested to get or compute partitions.
Note
|
compute method runs on the driver.
|