关注 spark技术分享,
撸spark源码 玩spark最佳实践

Dataset Checkpointing

Dataset Checkpointing

Dataset Checkpointing is a feature of Spark SQL to truncate a logical query plan that could specifically be useful for highly iterative data algorithms (e.g. Spark MLlib that uses Spark SQL’s Dataset API for data manipulation).

Note

Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming – the now-obsolete Spark module for stream processing based on RDD API.

Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.

Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.

Checkpointing can be eager or lazy per eager flag of checkpoint operator. Eager checkpointing is the default checkpointing and happens immediately when requested. Lazy checkpointing does not and will only happen when an action is executed.

Using Dataset checkpointing requires that you specify the checkpoint directory. The directory stores the checkpoint files for RDDs to be checkpointed. Use SparkContext.setCheckpointDir to set the path to a checkpoint directory.

Checkpointing can be local or reliable which defines how reliable the checkpoint directory is. Local checkpointing uses executor storage to write checkpoint files to and due to the executor lifecycle is considered unreliable. Reliable checkpointing uses a reliable data storage like Hadoop HDFS.

Table 1. Dataset Checkpointing Types
Eager Lazy

Reliable

checkpoint

checkpoint(eager = false)

Local

localCheckpoint

localCheckpoint(eager = false)

A RDD can be recovered from a checkpoint files using SparkContext.checkpointFile. You can use SparkSession.internalCreateDataFrame method to (re)create the DataFrame from the RDD of internal binary rows.

Tip

Enable INFO logging level for org.apache.spark.rdd.ReliableRDDCheckpointData logger to see what happens while an RDD is checkpointed.

Add the following line to conf/log4j.properties:

Refer to Logging.

Specyfing Checkpoint Directory — SparkContext.setCheckpointDir Method

setCheckpointDir sets the checkpoint directory.

Internally, setCheckpointDir…​FIXME

Recovering RDD From Checkpoint Files — SparkContext.checkpointFile Method

checkpointFile reads (recovers) a RDD from a checkpoint directory.

Note
SparkContext.checkpointFile is a protected[spark] method so the code to access it has to be in org.apache.spark package.

Internally, checkpointFile creates a ReliableCheckpointRDD in a scope.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Dataset Checkpointing
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏