关注 spark技术分享,
撸spark源码 玩spark最佳实践

Broadcast variables

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

And later in the document:

Explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

sparkcontext broadcast executors.png
Figure 1. Broadcasting a value to executors

To use a broadcast value in a Spark transformation you have to create it first using SparkContext.broadcast and then use value method to access the shared value. Learn it in Introductory Example section.

The Broadcast feature in Spark uses SparkContext to create broadcast values and BroadcastManager and ContextCleaner to manage their lifecycle.

sparkcontext broadcastmanager contextcleaner.png
Figure 2. SparkContext to broadcast using BroadcastManager and ContextCleaner
Tip
Not only can Spark developers use broadcast variables for efficient data distribution, but Spark itself uses them quite often. A very notable use case is when Spark distributes tasks to executors for their execution. That does change my perspective on the role of broadcast variables in Spark.

Broadcast Spark Developer-Facing Contract

The developer-facing Broadcast contract allows Spark developers to use it in their applications.

Table 1. Broadcast API
Method Name Description

id

The unique identifier

value

The value

unpersist

Asynchronously deletes cached copies of this broadcast on the executors.

destroy

Destroys all data and metadata related to this broadcast variable.

toString

The string representation

Lifecycle of Broadcast Variable

You can create a broadcast variable of type T using SparkContext.broadcast method.

Tip

Enable DEBUG logging level for org.apache.spark.storage.BlockManager logger to debug broadcast method.

Read BlockManager to find out how to enable the logging level.

With DEBUG logging level enabled, you should see the following messages in the logs:

After creating an instance of a broadcast variable, you can then reference the value using value method.

Note
value method is the only way to access the value of a broadcast variable.

With DEBUG logging level enabled, you should see the following messages in the logs:

When you are done with a broadcast variable, you should destroy it to release memory.

With DEBUG logging level enabled, you should see the following messages in the logs:

Before destroying a broadcast variable, you may want to unpersist it.

Getting the Value of Broadcast Variable — value Method

value returns the value of a broadcast variable. You can only access the value until it is destroyed after which you will see the following SparkException exception in the logs:

Internally, value makes sure that the broadcast variable is valid, i.e. destroy was not called, and, if so, calls the abstract getValue method.

Note

getValue is abstracted and broadcast variable implementations are supposed to provide a concrete behaviour.

Refer to TorrentBroadcast.

Unpersisting Broadcast Variable — unpersist Methods

Destroying Broadcast Variable — destroy Method

destroy removes a broadcast variable.

Note
Once a broadcast variable has been destroyed, it cannot be used again.

If you try to destroy a broadcast variable more than once, you will see the following SparkException exception in the logs:

Internally, destroy executes the internal destroy (with blocking enabled).

Removing Persisted Data of Broadcast Variable — destroy Internal Method

destroy destroys all data and metadata of a broadcast variable.

Note
destroy is a private[spark] method.

Internally, destroy marks a broadcast variable destroyed, i.e. the internal _isValid flag is disabled.

You should see the following INFO message in the logs:

In the end, doDestroy method is executed (that broadcast implementations are supposed to provide).

Note
doDestroy is part of the Broadcast contract for broadcast implementations so they can provide their own custom behaviour.

Introductory Example

Let’s start with an introductory example to check out how to use broadcast variables and build your initial understanding.

You’re going to use a static mapping of interesting projects with their websites, i.e. Map[String, String] that the tasks, i.e. closures (anonymous functions) in transformations, use.

It works, but is very ineffective as the pws map is sent over the wire to executors while it could have been there already. If there were more tasks that need the pws map, you could improve their performance by minimizing the number of bytes that are going to be sent over the network for task execution.

Enter broadcast variables.

Semantically, the two computations – with and without the broadcast value – are exactly the same, but the broadcast-based one wins performance-wise when there are more executors spawned to execute many tasks that use pws map.

Introduction

Broadcast is part of Spark that is responsible for broadcasting information across nodes in a cluster.

You use broadcast variable to implement map-side join, i.e. a join using a map. For this, lookup tables are distributed across nodes in a cluster using broadcast and then looked up inside map (to do the join implicitly).

When you broadcast a value, it is copied to executors only once (while it is copied multiple times for tasks otherwise). It means that broadcast can help to get your Spark application faster if you have a large value to use in tasks or there are more tasks than executors.

It appears that a Spark idiom emerges that uses broadcast with collectAsMap to create a Map for broadcast. When an RDD is map over to a smaller dataset (column-wise not record-wise), collectAsMap, and broadcast, using the very big RDD to map its elements to the broadcast RDDs is computationally faster.

Use large broadcasted HashMaps over RDDs whenever possible and leave RDDs with a key to lookup necessary data as demonstrated above.

Spark comes with a BitTorrent implementation.

It is not enabled by default.

Broadcast Contract

The Broadcast contract is made up of the following methods that custom Broadcast implementations are supposed to provide:

  1. getValue

  2. doUnpersist

  3. doDestroy

Note
TorrentBroadcast is the only implementation of the Broadcast contract.
Note
Broadcast Spark Developer-Facing Contract is the developer-facing Broadcast contract that allows Spark developers to use it in their applications.

Further Reading or Watching

赞(0) 打赏
未经允许不得转载:spark技术分享 » Broadcast variables
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏