关注 spark技术分享,
撸spark源码 玩spark最佳实践

Spark on YARN

Spark on YARN

You can submit Spark applications to a Hadoop YARN cluster using yarn master URL.

Note
Since Spark 2.0.0, yarn master URL is the only proper master URL and you can use --deploy-mode to choose between client (default) or cluster modes.
spark yarn ApplicationMaster client submitApplication.png
Figure 1. Submitting Spark Application to YARN Cluster (aka Creating SparkContext with yarn Master URL and client Deploy Mode)

Without specifying the deploy mode, it is assumed client.

There are two deploy modes for YARN — client (default) or cluster.

Tip
Deploy modes are all about where the Spark driver runs.

In client mode the Spark driver (and SparkContext) runs on a client node outside a YARN cluster whereas in cluster mode it runs inside a YARN cluster, i.e. inside a YARN container alongside ApplicationMaster (that acts as the Spark application in YARN).

In that sense, a Spark application deployed to YARN is a YARN-compatible execution framework that can be deployed to a YARN cluster (alongside other Hadoop workloads). On YARN, a Spark executor maps to a single YARN container.

Note
In order to deploy applications to YARN clusters, you need to use Spark with YARN support.

Spark on YARN supports multiple application attempts and supports data locality for data in HDFS. You can also take advantage of Hadoop’s security and run Spark in a secure Hadoop environment using Kerberos authentication (aka Kerberized clusters).

There are few settings that are specific to YARN (see Settings). Among them, you can particularly like the support for YARN resource queues (to divide cluster resources and allocate shares to different teams and users based on advanced policies).

Tip
You can start spark-submit with --verbose command-line option to have some settings displayed, including YARN-specific. See spark-submit and YARN options.

The memory in the YARN resource requests is --executor-memory + what’s set for spark.yarn.executor.memoryOverhead, which defaults to 10% of --executor-memory.

If YARN has enough resources it will deploy the executors distributed across the cluster, then each of them will try to process the data locally (NODE_LOCAL in Spark Web UI), with as many splits in parallel as you defined in spark.executor.cores.

Multiple Application Attempts

Spark on YARN supports multiple application attempts in cluster mode.

Caution
FIXME

spark-submit and YARN options

When you submit your Spark applications using spark-submit you can use the following YARN-specific command-line options:

  • --archives

  • --executor-cores

  • --keytab

  • --num-executors

  • --principal

  • –queue

Tip
Read about the corresponding settings in Settings in this document.

Memory Requirements

The memory for ApplicationMaster is controlled by custom settings per deploy mode.

For client deploy mode it is a sum of spark.yarn.am.memory (default: 512m) with an optional overhead as spark.yarn.am.memoryOverhead.

For cluster deploy mode it is a sum of spark.driver.memory (default: 1g) with an optional overhead as spark.yarn.driver.memoryOverhead.

If the optional overhead is not set, it is computed as 10% of the main memory (spark.yarn.am.memory for client mode or spark.driver.memory for cluster mode) or 384m whatever is larger.

Spark with YARN support

You need to have Spark that has been compiled with YARN support, i.e. the class org.apache.spark.deploy.yarn.Client must be on the CLASSPATH.

Otherwise, you will see the following error in the logs and Spark will exit.

Master URL

Since Spark 2.0.0, the only proper master URL is yarn.

Before Spark 2.0.0, you could have used yarn-client or yarn-cluster, but it is now deprecated. When you use the deprecated master URLs, you should see the following warning in the logs:

Keytab

Caution
FIXME

When a principal is specified a keytab must be specified, too.

The settings spark.yarn.principal and spark.yarn.principal will be set to respective values and UserGroupInformation.loginUserFromKeytab will be called with their values as input arguments.

Environment Variables

SPARK_DIST_CLASSPATH

SPARK_DIST_CLASSPATH is a distribution-defined CLASSPATH to add to processes.

Settings

Caution
FIXME Where and how are they used?
赞(0) 打赏
未经允许不得转载:spark技术分享 » Spark on YARN
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏