BlockStoreShuffleReader-spark技术分享

BlockStoreShuffleReader

BlockStoreShuffleReader is the one and only known ShuffleReader that reads the combined key-values for the reduce task (for a range of start and end reduce partitions) from a shuffle by requesting them from block managers.

BlockStoreShuffleReader is created exclusively when SortShuffleManager is requested for the ShuffleReader for a range of reduce partitions.

Reading Combined Key-Value Records For Reduce Task (using ShuffleBlockFetcherIterator) — `read` Method



read(): Iterator[Product2[K, C]]

read(): Iterator[Product2[K, C]]

Note	`read` is part of ShuffleReader Contract.

Internally, read first creates a ShuffleBlockFetcherIterator (passing in the values of spark.reducer.maxSizeInFlight, spark.reducer.maxReqsInFlight and spark.shuffle.detectCorrupt Spark properties).

Note	`read` uses `BlockManager` to access `ShuffleClient` to create `ShuffleBlockFetcherIterator`.

Note	`read` uses `MapOutputTracker` to find the BlockManagers with the shuffle blocks and sizes to create `ShuffleBlockFetcherIterator`.

read creates a new SerializerInstance (using Serializer from ShuffleDependency).

read creates a key/value iterator by deserializeStream every shuffle block stream.

read updates the context task metrics for each record read.

Note	`read` uses `CompletionIterator` (to count the records read) and InterruptibleIterator (to support task cancellation).

If the ShuffleDependency has an Aggregator defined, read wraps the current iterator inside an iterator defined by Aggregator.combineCombinersByKey (for mapSideCombine enabled) or Aggregator.combineValuesByKey otherwise.

Note	`run` reports an exception when `ShuffleDependency` has no `Aggregator` defined with `mapSideCombine` flag enabled.

For keyOrdering defined in ShuffleDependency, run does the following:

Creates an ExternalSorter
Inserts all the records into the ExternalSorter
Updates context TaskMetrics
Returns a CompletionIterator for the ExternalSorter

Settings

Table 1. Spark Properties
Spark Property	Default Value	Description
`spark.reducer.maxSizeInFlight`	`48m`	Maximum size (in bytes) of map outputs to fetch simultaneously from each reduce task. Since each output requires a new buffer to receive it, this represents a fixed memory overhead per reduce task, so keep it small unless you have a large amount of memory. Used when `BlockStoreShuffleReader` creates a `ShuffleBlockFetcherIterator` to read records.
`spark.reducer.maxReqsInFlight`	(unlimited)	The maximum number of remote requests to fetch blocks at any given point. When the number of hosts in the cluster increases, it might lead to very large number of in-bound connections to one or more nodes, causing the workers to fail under load. By allowing it to limit the number of fetch requests, this scenario can be mitigated. Used when `BlockStoreShuffleReader` creates a `ShuffleBlockFetcherIterator` to read records.
`spark.shuffle.detectCorrupt`	`true`	Controls whether to detect any corruption in fetched blocks. Used when `BlockStoreShuffleReader` creates a `ShuffleBlockFetcherIterator` to read records.

Creating BlockStoreShuffleReader Instance

BlockStoreShuffleReader takes the following when created:

BaseShuffleHandle
Reduce start partition index
Reduce end partition index
TaskContext
SerializerManager
BlockManager
MapOutputTracker

BlockStoreShuffleReader initializes the internal registries and counters.

Note	`BlockStoreShuffleReader` uses `SparkEnv` to access the SerializerManager, BlockManager and MapOutputTracker.

BlockStoreShuffleReader

BlockStoreShuffleReader

Reading Combined Key-Value Records For Reduce Task (using ShuffleBlockFetcherIterator) — `read` Method

Settings

Creating BlockStoreShuffleReader Instance

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

BlockStoreShuffleReader

Reading Combined Key-Value Records For Reduce Task (using ShuffleBlockFetcherIterator) — read Method

Settings

Creating BlockStoreShuffleReader Instance

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

Reading Combined Key-Value Records For Reduce Task (using ShuffleBlockFetcherIterator) — `read` Method