YarnClusterSchedulerBackend-spark技术分享

YarnClusterSchedulerBackend – SchedulerBackend for YARN in Cluster Deploy Mode

YarnClusterSchedulerBackend is a custom YarnSchedulerBackend for Spark on YARN in cluster deploy mode.

This is a scheduler backend that supports multiple application attempts and URLs for driver’s logs to display as links in the web UI in the Executors tab for the driver.

It uses spark.yarn.app.attemptId under the covers (that the YARN resource manager sets?).

Note	`YarnClusterSchedulerBackend` is a `private[spark]` Scala class. You can find the sources in org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.

Tip

Enable DEBUG logging level for org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=DEBUG

log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=DEBUG

Refer to Logging.

Creating a YarnClusterSchedulerBackend object requires a TaskSchedulerImpl and SparkContext objects.

YarnClusterSchedulerBackend comes with a custom start method.

Note	`start` is part of the SchedulerBackend Contract.

getDriverLogUrls in YarnClusterSchedulerBackend calculates the URLs for the driver’s logs – standard output (stdout) and standard error (stderr).

Note	`getDriverLogUrls` is part of the SchedulerBackend Contract.

Internally, it retrieves the container id and through environment variables computes the base URL.

You should see the following DEBUG in the logs:



DEBUG Base URL for logs: [baseUrl]

DEBUG Base URL for logs: [baseUrl]