NarrowDependency — Narrow Dependencies
NarrowDependency is a base (abstract) Dependency with narrow (limited) number of partitions of the parent RDD that are required to compute a partition of the child RDD.
|
Note
|
Narrow dependencies allow for pipelined execution. |
| Name | Description |
|---|---|
NarrowDependency Contract
NarrowDependency contract assumes that extensions implement getParents method.
|
1 2 3 4 5 |
def getParents(partitionId: Int): Seq[Int] |
getParents returns the partitions of the parent RDD that the input partitionId depends on.
OneToOneDependency
OneToOneDependency is a narrow dependency that represents a one-to-one dependency between partitions of the parent and child RDDs.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
scala> val r1 = sc.parallelize(0 to 9) r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:18 scala> val r3 = r1.map((_, 1)) r3: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[19] at map at <console>:20 scala> r3.dependencies res32: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@7353a0fb) scala> r3.toDebugString res33: String = (8) MapPartitionsRDD[19] at map at <console>:20 [] | ParallelCollectionRDD[13] at parallelize at <console>:18 [] |
PruneDependency
PruneDependency is a narrow dependency that represents a dependency between the PartitionPruningRDD and its parent RDD.
RangeDependency
RangeDependency is a narrow dependency that represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
It is used in UnionRDD for SparkContext.union, RDD.union transformation to list only a few.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
scala> val r1 = sc.parallelize(0 to 9) r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:18 scala> val r2 = sc.parallelize(10 to 19) r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:18 scala> val unioned = sc.union(r1, r2) unioned: org.apache.spark.rdd.RDD[Int] = UnionRDD[16] at union at <console>:22 scala> unioned.dependencies res19: Seq[org.apache.spark.Dependency[_]] = ArrayBuffer(org.apache.spark.RangeDependency@28408ad7, org.apache.spark.RangeDependency@6e1d2e9f) scala> unioned.toDebugString res18: String = (16) UnionRDD[16] at union at <console>:22 [] | ParallelCollectionRDD[13] at parallelize at <console>:18 [] | ParallelCollectionRDD[14] at parallelize at <console>:18 [] |
spark技术分享