PairRDDFunctions-spark技术分享

PairRDDFunctions

Tip	Read up the scaladoc of PairRDDFunctions.

PairRDDFunctions are available in RDDs of key-value pairs via Scala’s implicit conversion.

Tip	Partitioning is an advanced feature that is directly linked to (or inferred by) use of `PairRDDFunctions`. Read up about it in Partitions and Partitioning.

`countApproxDistinctByKey` Transformation

Caution

FIXME

`foldByKey` Transformation

Caution

FIXME

`aggregateByKey` Transformation

Caution

FIXME

`combineByKey` Transformation

Caution

FIXME

`partitionBy` Operator



partitionBy(partitioner: Partitioner): RDD[(K, V)]

partitionBy(partitioner: Partitioner): RDD[(K, V)]

Caution

FIXME

`groupByKey` and `reduceByKey` Transformations

reduceByKey is sort of a particular case of aggregateByKey.

You may want to look at the number of partitions from another angle.

It may often not be important to have a given number of partitions upfront (at RDD creation time upon loading data from data sources), so only “regrouping” the data by key after it is an RDD might be…the key (pun not intended).

You can use groupByKey or another PairRDDFunctions method to have a key in one processing flow.

You could use partitionBy that is available for RDDs to be RDDs of tuples, i.e. PairRDD:



rdd.keyBy(_.kind)
  .partitionBy(new HashPartitioner(PARTITIONS))
  .foreachPartition(...)

rdd.keyBy(_.kind)

.partitionBy(new HashPartitioner(PARTITIONS))

.foreachPartition(...)

Think of situations where kind has low cardinality or highly skewed distribution and using the technique for partitioning might be not an optimal solution.

You could do as follows:



rdd.keyBy(_.kind).reduceByKey(....)

rdd.keyBy(_.kind).reduceByKey(....)

or mapValues or plenty of other solutions. FIXME, man.

mapValues, flatMapValues

Caution

FIXME

`combineByKeyWithClassTag` Transformations



combineByKeyWithClassTag[C](
  createCombiner: V => C,
  mergeValue: (C, V) => C,
  mergeCombiners: (C, C) => C)(implicit ct: ClassTag[C]): RDD[(K, C)] (1)
combineByKeyWithClassTag[C](
  createCombiner: V => C,
  mergeValue: (C, V) => C,
  mergeCombiners: (C, C) => C,
  numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] (2)
combineByKeyWithClassTag[C](
  createCombiner: V => C,
  mergeValue: (C, V) => C,
  mergeCombiners: (C, C) => C,
  partitioner: Partitioner,
  mapSideCombine: Boolean = true,
  serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]

combineByKeyWithClassTag[C](

createCombiner: V => C,

mergeValue: (C, V) => C,

mergeCombiners: (C, C) => C)(implicit ct: ClassTag[C]): RDD[(K, C)] (1)

combineByKeyWithClassTag[C](

createCombiner: V => C,

mergeValue: (C, V) => C,

mergeCombiners: (C, C) => C,

numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] (2)

combineByKeyWithClassTag[C](

createCombiner: V => C,

mergeValue: (C, V) => C,

mergeCombiners: (C, C) => C,

partitioner: Partitioner,

mapSideCombine: Boolean = true,

serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]

FIXME
FIXME too

combineByKeyWithClassTag transformations use mapSideCombine enabled (i.e. true) by default. They create a ShuffledRDD with the value of mapSideCombine when the input partitioner is different from the current one in an RDD.

Note	`combineByKeyWithClassTag` is a base transformation for combineByKey-based transformations, aggregateByKey, foldByKey, reduceByKey, countApproxDistinctByKey, and groupByKey.

PairRDDFunctions