关注 spark技术分享,
撸spark源码 玩spark最佳实践

RDD Lineage — Logical Execution Plan

RDD Lineage — Logical Execution Plan

RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.

Note
The execution DAG or physical execution plan is the DAG of stages.
Note
The following diagram uses cartesian or zip for learning purposes only. You may use other operators to build a RDD graph.
rdd lineage.png
Figure 1. RDD lineage

The above RDD graph could be the result of the following series of transformations:

A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

You can learn about a RDD lineage graph using RDD.toDebugString method.

Logical Execution Plan

Logical Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute.

Note
A logical plan, i.e. a DAG, is materialized and executed when SparkContext is requested to run a Spark job.

Getting RDD Lineage Graph — toDebugString Method

You can learn about a RDD lineage graph using toDebugString method.

toDebugString uses indentations to indicate a shuffle boundary.

The numbers in round brackets show the level of parallelism at each stage, e.g. (2) in the above output.

With spark.logLineage property enabled, toDebugString is included when executing an action.

Settings

Table 1. Spark Properties
Spark Property Default Value Description

spark.logLineage

false

When enabled (i.e. true), executing an action (and hence running a job) will also print out the RDD lineage graph using RDD.toDebugString.

赞(0) 打赏
未经允许不得转载:spark技术分享 » RDD Lineage — Logical Execution Plan
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏