RDD-spark技术分享

RDD

RDD is a description of a distributed computation over dataset of records of type T.

RDD is identified by a unique identifier (aka RDD ID) that is unique among all RDDs in a SparkContext.



id: Int

id: Int

RDD has a storage level that…FIXME



storageLevel: StorageLevel

storageLevel: StorageLevel

The storage level of an RDD is StorageLevel.NONE by default which is…FIXME

Getting Or Computing RDD Partition — `getOrCompute` Method



getOrCompute(partition: Partition, context: TaskContext): Iterator[T]

getOrCompute(partition: Partition, context: TaskContext): Iterator[T]

getOrCompute creates a RDDBlockId for the RDD id and the partition index.

getOrCompute requests the BlockManager to getOrElseUpdate for the block ID (with the storage level and the makeIterator function).

Note	`getOrCompute` uses SparkEnv to access the current BlockManager.

getOrCompute records whether…FIXME (readCachedBlock)

getOrCompute branches off per the response from the BlockManager and whether the internal readCachedBlock flag is now on or still off. In either case, getOrCompute creates an InterruptibleIterator.

Note	InterruptibleIterator simply delegates to a wrapped internal `Iterator`, but allows for task killing functionality.

For a BlockResult available and readCachedBlock flag on, getOrCompute…FIXME

For a BlockResult available and readCachedBlock flag off, getOrCompute…FIXME

Note	The `BlockResult` could be found in a local block manager or fetched from a remote block manager. It may also have been stored (persisted) just now. In either case, the `BlockResult` is available (and BlockManager.getOrElseUpdate gives a `Left` value with the `BlockResult`).

For Right(iter) (regardless of the value of readCachedBlock flag since…FIXME), getOrCompute…FIXME

Note	BlockManager.getOrElseUpdate gives a `Right(iter)` value to indicate an error with a block.

Note	`getOrCompute` is used on Spark executors.

Note	`getOrCompute` is a `private[spark]` method that is exclusively used when iterating over partition when a RDD is cached.

Computing Partition (in TaskContext) — `compute` Method



compute(split: Partition, context: TaskContext): Iterator[T]

compute(split: Partition, context: TaskContext): Iterator[T]

The abstract compute method computes the input split partition in the TaskContext to produce a collection of values (of type T).

compute is implemented by any type of RDD in Spark and is called every time the records are requested unless RDD is cached or checkpointed (and the records can be read from an external storage, but this time closer to the compute node).

When an RDD is cached, for specified storage levels (i.e. all but NONE) CacheManager is requested to get or compute partitions.

Note	`compute` method runs on the driver.

RDD

RDD

Getting Or Computing RDD Partition — `getOrCompute` Method

Computing Partition (in TaskContext) — `compute` Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

RDD

Getting Or Computing RDD Partition — getOrCompute Method

Computing Partition (in TaskContext) — compute Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

Getting Or Computing RDD Partition — `getOrCompute` Method

Computing Partition (in TaskContext) — `compute` Method