关注 spark技术分享,
撸spark源码 玩spark最佳实践

Data Locality

Data locality / placement

Spark relies on data locality, aka data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.

In Spark on YARN Spark tries to place tasks alongside HDFS blocks.

With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits), and then schedules the work to the SparkWorkers.

Spark’s compute nodes / workers should be running on storage nodes.

Concept of locality-aware scheduling.

Spark tries to execute tasks as close to the data as possible to minimize data transfer (over the wire).

sparkui stages locality level.png
Figure 1. Locality Level in the Spark UI

There are the following task localities (consult org.apache.spark.scheduler.TaskLocality object):

  • PROCESS_LOCAL

  • NODE_LOCAL

  • NO_PREF

  • RACK_LOCAL

  • ANY

Task location can either be a host or a pair of a host and an executor.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Data Locality
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏