ResultTask-spark技术分享

ResultTask

ResultTask is a Task that executes a function on the records in a RDD partition.

ResultTask is created exclusively when DAGScheduler submits missing tasks for a ResultStage.

ResultTask is created with a broadcast variable with the RDD and the function to execute it on and the partition.

Table 1. ResultTask’s Internal Registries and Counters
Name	Description
`preferredLocs`	Collection of TaskLocations. Corresponds directly to unique entries in locs with the only rule that when `locs` is not defined, it is empty, and no task location preferences are defined. Initialized when `ResultTask` is created. Used exclusively when `ResultTask` is requested for preferred locations.

Creating ResultTask Instance

ResultTask takes the following when created:

stageId — the stage the task is executed for
stageAttemptId — the stage attempt id
Broadcast variable with the serialized task (as Array[Byte]). The broadcast contains of a serialized pair of RDD and the function to execute.
Partition to compute
Collection of TaskLocations, i.e. preferred locations (executors) to execute the task on
outputId
local Properties
The stage’s serialized TaskMetrics (as Array[Byte])
(optional) Job id
(optional) Application id
(optional) Application attempt id

ResultTask initializes the internal registries and counters.

`preferredLocations` Method



preferredLocations: Seq[TaskLocation]

preferredLocations: Seq[TaskLocation]

Note	`preferredLocations` is part of Task contract.

preferredLocations simply returns preferredLocs internal property.

Deserialize RDD and Function (From Broadcast) and Execute Function (on RDD Partition) — `runTask` Method



runTask(context: TaskContext): U

runTask(context: TaskContext): U

Note	`U` is the type of a result as defined when `ResultTask` is created.

runTask deserializes a RDD and a function from the broadcast and then executes the function (on the records from the RDD partition).

Note	`runTask` is part of Task contract to run a task.

Internally, runTask starts by tracking the time required to deserialize a RDD and a function to execute.

runTask creates a new closure Serializer.

Note	`runTask` uses `SparkEnv` to access the current closure `Serializer`.

runTask requests the closure Serializer to deserialize an RDD and the function to execute (from taskBinary broadcast).

Note	taskBinary broadcast is defined when `ResultTask` is created.

runTask records _executorDeserializeTime and _executorDeserializeCpuTime properties.

In the end, runTask executes the function (passing in the input context and the records from partition of the RDD).

Note	`partition` to use to access the records in a deserialized RDD is defined when `ResultTask` was created.

ResultTask

ResultTask

Creating ResultTask Instance

`preferredLocations` Method

Deserialize RDD and Function (From Broadcast) and Execute Function (on RDD Partition) — `runTask` Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

ResultTask

Creating ResultTask Instance

preferredLocations Method

Deserialize RDD and Function (From Broadcast) and Execute Function (on RDD Partition) — runTask Method

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部

`preferredLocations` Method

Deserialize RDD and Function (From Broadcast) and Execute Function (on RDD Partition) — `runTask` Method