关注 spark技术分享,
撸spark源码 玩spark最佳实践

Checkpointing

Checkpointing

Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system.

There are two types of checkpointing:

  • reliable – in Spark (core), RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system, e.g. HDFS.

  • local – in Spark Streaming or GraphX – RDD checkpointing that truncates RDD lineage graph.

It’s up to a Spark application developer to decide when and how to checkpoint using RDD.checkpoint() method.

Before checkpointing is used, a Spark developer has to set the checkpoint directory using SparkContext.setCheckpointDir(directory: String) method.

Reliable Checkpointing

You call SparkContext.setCheckpointDir(directory: String) to set the checkpoint directory – the directory where RDDs are checkpointed. The directory must be a HDFS path if running on a cluster. The reason is that the driver may attempt to reconstruct the checkpointed RDD from its own local file system, which is incorrect because the checkpoint files are actually on the executor machines.

You mark an RDD for checkpointing by calling RDD.checkpoint(). The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.

Note
It is strongly recommended that a checkpointed RDD is persisted in memory, otherwise saving it on a file will require recomputation.

When an action is called on a checkpointed RDD, the following INFO message is printed out in the logs:

ReliableRDDCheckpointData

When RDD.checkpoint() operation is called, all the information related to RDD checkpointing are in ReliableRDDCheckpointData.

ReliableCheckpointRDD

After RDD.checkpoint the RDD has ReliableCheckpointRDD as the new parent with the exact number of partitions as the RDD.

Marking RDD for Local Checkpointing — localCheckpoint Method

localCheckpoint marks a RDD for local checkpointing using Spark’s caching layer.

localCheckpoint is for users who wish to truncate RDD lineage graph while skipping the expensive step of replicating the materialized data in a reliable distributed file system. This is useful for RDDs with long lineages that need to be truncated periodically, e.g. GraphX.

Local checkpointing trades fault-tolerance for performance.

Note
The checkpoint directory set through SparkContext.setCheckpointDir is not used.

LocalRDDCheckpointData

LocalCheckpointRDD

doCheckpoint Method

Caution
FIXME
赞(0) 打赏
未经允许不得转载:spark技术分享 » Checkpointing
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏