关注 spark技术分享,
撸spark源码 玩spark最佳实践

Jobs

ActiveJob

A job (aka action job or active job) is a top-level work item (computation) submitted to DAGScheduler to compute the result of an action (or for Adaptive Query Planning / Adaptive Scheduling).

action job.png
Figure 1. RDD actions submit jobs to DAGScheduler

Computing a job is equivalent to computing the partitions of the RDD the action has been executed upon. The number of partitions in a job depends on the type of a stage – ResultStage or ShuffleMapStage.

A job starts with a single target RDD, but can ultimately include other RDDs that are all part of the target RDD’s lineage graph.

The parent stages are the instances of ShuffleMapStage.

rdd job partitions.png
Figure 2. Computing a job is computing the partitions of an RDD
Note
Note that not all partitions have always to be computed for ResultStages for actions like first() and lookup().

Internally, a job is represented by an instance of private[spark] class org.apache.spark.scheduler.ActiveJob.

Caution
  • Where are instances of ActiveJob used?

A job can be one of two logical types (that are only distinguished by an internal finalStage field of ActiveJob):

Jobs track how many partitions have already been computed (using finished array of Boolean elements).

赞(0) 打赏
未经允许不得转载:spark技术分享 » Jobs
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏