关注 spark技术分享,
撸spark源码 玩spark最佳实践

Transformations

Transformations

Transformations are lazy operations on a RDD that create one or many new RDDs, e.g. map, filter, reduceByKey, join, cogroup, randomSplit.

In other words, transformations are functions that take a RDD as the input and produce one or many RDDs as the output. They do not change the input RDD (since RDDs are immutable and hence cannot be modified), but always produce one or more new RDDs by applying the computations they represent.

By applying transformations you incrementally build a RDD lineage with all the parent RDDs of the final RDD(s).

Transformations are lazy, i.e. are not executed immediately. Only after calling an action are transformations executed.

After executing a transformation, the result RDD(s) will always be different from their parents and can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or the same size (e.g. map).

Caution
There are transformations that may trigger jobs, e.g. sortBy, zipWithIndex, etc.
rdd sparkcontext transformations action.png
Figure 1. From SparkContext by transformations to the result

Certain transformations can be pipelined which is an optimization that Spark uses to improve performance of computations.

There are two kinds of transformations:

Narrow Transformations

Narrow transformations are the result of map, filter and such that is from the data from a single partition only, i.e. it is self-sustained.

An output RDD has partitions with records that originate from a single partition in the parent RDD. Only a limited subset of partitions used to calculate the result.

Spark groups narrow transformations as a stage which is called pipelining.

Wide Transformations

Wide transformations are the result of groupByKey and reduceByKey. The data required to compute the records in a single partition may reside in many partitions of the parent RDD.

Note
Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.

All of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.

map

Caution
FIXME

flatMap

Caution
FIXME

filter

Caution
FIXME

randomSplit

Caution
FIXME

mapPartitions

Caution
FIXME

Using an external key-value store (like HBase, Redis, Cassandra) and performing lookups/updates inside of your mappers (creating a connection within a mapPartitions code block to avoid the connection setup/teardown overhead) might be a better solution.

If hbase is used as the external key value store, atomicity is guaranteed

zipWithIndex

zipWithIndex zips this RDD[T] with its element indices.

Caution

If the number of partitions of the source RDD is greater than 1, it will submit an additional job to calculate start indices.

spark transformations zipWithIndex webui.png
Figure 2. Spark job submitted by zipWithIndex transformation
赞(0) 打赏
未经允许不得转载:spark技术分享 » Transformations
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏