关注 spark技术分享,
撸spark源码 玩spark最佳实践

Delta Porting Existing Workloads to Delta Lake

When you port existing workloads to Delta Lake, you should be aware of the following simplifications and differences compared with the data sources provided by Apache Spark and Apache Hive.

Delta Lake handles the following operations automatically, which you should never perform manually:

Load a single partition

As an optimization, you may sometimes directly load the partition of data you are interested in. For example, spark.read.parquet("/data/date=2017-01-01"). This is unnecessary with Delta Lake, since it can quickly scan the list of files to find the list of relevant ones. If you are interested in a single partition, specify it using a WHERE clause. For example, spark.read.parquet("/data").where("date = '2017-01-01'").

When you port an existing application to Delta Lake, you should avoid the following operations, which bypass the transaction log:

Manually modify data

Delta Lake uses the transaction log to atomically commit changes to the table. Because the log is the source of truth, files that are written out but not added to the transaction log are not read by Spark. Similarly, even if you manually delete a file, a pointer to the file is still present in the transaction log.

External readers

The data stored in Delta Lake is encoded as Parquet files. However, accessing these files using an external reader is not safe.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Delta Porting Existing Workloads to Delta Lake
分享到: 更多 (0)

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏