关注 spark技术分享,
撸spark源码 玩spark最佳实践

YarnShuffleService — ExternalShuffleService on YARN

YarnShuffleService — ExternalShuffleService on YARN

YarnShuffleService is an external shuffle service for Spark on YARN. It is YARN NodeManager’s auxiliary service that implements org.apache.hadoop.yarn.server.api.AuxiliaryService.

Note
There is the ExternalShuffleService for Spark and despite their names they don’t share code.
Caution
FIXME What happens when the spark.shuffle.service.enabled flag is enabled?

After the external shuffle service is configured in YARN you enable it in a Spark application using spark.shuffle.service.enabled flag.

Note
YarnShuffleService was introduced in SPARK-3797.
Tip

Enable INFO logging level for org.apache.spark.network.yarn.YarnShuffleService logger in YARN logging system to see what happens inside.

YARN saves logs in /usr/local/Cellar/hadoop/2.7.2/libexec/logs directory on Mac OS X with brew, e.g. /usr/local/Cellar/hadoop/2.7.2/libexec/logs/yarn-jacek-nodemanager-japila.local.log.

Advantages

The advantages of using the YARN Shuffle Service:

  • With dynamic allocation enabled executors can be discarded and a Spark application could still get at the shuffle data the executors wrote out.

  • It allows individual executors to go into GC pause (or even crash) and still allow other Executors to read shuffle data and make progress.

Creating YarnShuffleService Instance

When YarnShuffleService is created, it calls YARN’s AuxiliaryService with spark_shuffle service name.

You should see the following INFO message in the logs:

getRecoveryPath

Caution
FIXME

serviceStop

serviceStop is part of YARN’s AuxiliaryService contract and is called when…​FIXME

Caution
FIXME The contract

When called, serviceStop simply closes shuffleServer and blockHandler.

Caution
FIXME What are shuffleServer and blockHandler? What’s their lifecycle?

When an exception occurs, you should see the following ERROR message in the logs:

stopContainer

stopContainer is part of YARN’s AuxiliaryService contract and is called when…​FIXME

Caution
FIXME The contract

When called, stopContainer simply prints out the following INFO message in the logs and exits.

It obtains the containerId from context using getContainerId method.

initializeContainer

initializeContainer is part of YARN’s AuxiliaryService contract and is called when…​FIXME

Caution
FIXME The contract

When called, initializeContainer simply prints out the following INFO message in the logs and exits.

It obtains the containerId from context using getContainerId method.

stopApplication

stopApplication is part of YARN’s AuxiliaryService contract and is called when…​FIXME

Caution
FIXME The contract

stopApplication requests the ShuffleSecretManager to unregisterApp when authentication is enabled and ExternalShuffleBlockHandler to applicationRemoved.

When called, stopApplication obtains YARN’s ApplicationId for the application (using the input context).

You should see the following INFO message in the logs:

If isAuthenticationEnabled, secretManager.unregisterApp is executed for the application id.

It requests ExternalShuffleBlockHandler to applicationRemoved (with cleanupLocalDirs flag disabled).

Caution
FIXME What does ExternalShuffleBlockHandler#applicationRemoved do?

When an exception occurs, you should see the following ERROR message in the logs:

initializeApplication

initializeApplication is part of YARN’s AuxiliaryService contract and is called when…​FIXME

Caution
FIXME The contract

initializeApplication requests the ShuffleSecretManager to registerApp when authentication is enabled.

When called, initializeApplication obtains YARN’s ApplicationId for the application (using the input context) and calls context.getApplicationDataForService for shuffleSecret.

You should see the following INFO message in the logs:

If isAuthenticationEnabled, secretManager.registerApp is executed for the application id and shuffleSecret.

When an exception occurs, you should see the following ERROR message in the logs:

serviceInit

serviceInit is part of YARN’s AuxiliaryService contract and is called when…​FIXME

Caution
FIXME

When called, serviceInit creates a TransportConf for the shuffle module that is used to create ExternalShuffleBlockHandler (as blockHandler).

It checks spark.authenticate key in the configuration (defaults to false) and if only authentication is enabled, it sets up a SaslServerBootstrap with a ShuffleSecretManager and adds it to a collection of TransportServerBootstraps.

It creates a TransportServer as shuffleServer to listen to spark.shuffle.service.port (default: 7337). It reads spark.shuffle.service.port key in the configuration.

You should see the following INFO message in the logs:

Installation

YARN Shuffle Service Plugin

Add the YARN Shuffle Service plugin from the common/network-yarn module to YARN NodeManager’s CLASSPATH.

Tip
Use yarn classpath command to know YARN’s CLASSPATH.

yarn-site.xml — NodeManager Configuration File

If external shuffle service is enabled, you need to add spark_shuffle to yarn.nodemanager.aux-services in the yarn-site.xml file on all nodes.

yarn.nodemanager.aux-services property is for the auxiliary service name being spark_shuffle with yarn.nodemanager.aux-services.spark_shuffle.class property being org.apache.spark.network.yarn.YarnShuffleService.

Exception — Attempting to Use External Shuffle Service in Spark Application in Spark on YARN

When you enable an external shuffle service in a Spark application when using Spark on YARN but do not install YARN Shuffle Service you will see the following exception in the logs:

赞(0) 打赏
未经允许不得转载:spark技术分享 » YarnShuffleService — ExternalShuffleService on YARN
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏