Data Locality-spark技术分享

Data locality / placement

Spark relies on data locality, aka data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.

In Spark on YARN Spark tries to place tasks alongside HDFS blocks.

With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits), and then schedules the work to the SparkWorkers.

Spark’s compute nodes / workers should be running on storage nodes.

Concept of locality-aware scheduling.

Spark tries to execute tasks as close to the data as possible to minimize data transfer (over the wire).

Figure 1. Locality Level in the Spark UI

There are the following task localities (consult org.apache.spark.scheduler.TaskLocality object):

PROCESS_LOCAL
NODE_LOCAL
NO_PREF
RACK_LOCAL
ANY

Task location can either be a host or a pair of a host and an executor.

Data Locality

Data locality / placement

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部