关注 spark技术分享,
撸spark源码 玩spark最佳实践

RDD

RDD

RDD is a description of a distributed computation over dataset of records of type T.

RDD is identified by a unique identifier (aka RDD ID) that is unique among all RDDs in a SparkContext.

RDD has a storage level that…​FIXME

The storage level of an RDD is StorageLevel.NONE by default which is…​FIXME

Getting Or Computing RDD Partition — getOrCompute Method

getOrCompute creates a RDDBlockId for the RDD id and the partition index.

getOrCompute requests the BlockManager to getOrElseUpdate for the block ID (with the storage level and the makeIterator function).

Note
getOrCompute uses SparkEnv to access the current BlockManager.

getOrCompute records whether…​FIXME (readCachedBlock)

getOrCompute branches off per the response from the BlockManager and whether the internal readCachedBlock flag is now on or still off. In either case, getOrCompute creates an InterruptibleIterator.

Note
InterruptibleIterator simply delegates to a wrapped internal Iterator, but allows for task killing functionality.

For a BlockResult available and readCachedBlock flag on, getOrCompute…​FIXME

For a BlockResult available and readCachedBlock flag off, getOrCompute…​FIXME

Note
The BlockResult could be found in a local block manager or fetched from a remote block manager. It may also have been stored (persisted) just now. In either case, the BlockResult is available (and BlockManager.getOrElseUpdate gives a Left value with the BlockResult).

For Right(iter) (regardless of the value of readCachedBlock flag since…​FIXME), getOrCompute…​FIXME

Note
BlockManager.getOrElseUpdate gives a Right(iter) value to indicate an error with a block.
Note
getOrCompute is used on Spark executors.
Note
getOrCompute is a private[spark] method that is exclusively used when iterating over partition when a RDD is cached.

Computing Partition (in TaskContext) — compute Method

The abstract compute method computes the input split partition in the TaskContext to produce a collection of values (of type T).

compute is implemented by any type of RDD in Spark and is called every time the records are requested unless RDD is cached or checkpointed (and the records can be read from an external storage, but this time closer to the compute node).

When an RDD is cached, for specified storage levels (i.e. all but NONE) CacheManager is requested to get or compute partitions.

Note
compute method runs on the driver.
赞(0) 打赏
未经允许不得转载:spark技术分享 » RDD
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏