FileScanRDD — Input RDD of FileSourceScanExec Physical Operator-spark技术分享

FileScanRDD — Input RDD of FileSourceScanExec Physical Operator

FileScanRDD is an RDD of internal binary rows (i.e. RDD[InternalRow]) that is the one and only input RDD of FileSourceScanExec physical operator.

FileScanRDD is created exclusively when FileSourceScanExec physical operator is requested to createBucketedReadRDD and createNonBucketedReadRDD (which is when FileSourceScanExec is requested for the input RDD that WholeStageCodegenExec physical operator uses when executed).



val q = spark.read.text("README.md")

val sparkPlan = q.queryExecution.executedPlan
import org.apache.spark.sql.execution.FileSourceScanExec
val scan = sparkPlan.collectFirst { case exec: FileSourceScanExec => exec }.get
val inputRDD = scan.inputRDDs.head

val rdd = q.queryExecution.toRdd
scala> println(rdd.toDebugString)
(1) MapPartitionsRDD[1] at toRdd at <console>:26 []
 |  FileScanRDD[0] at toRdd at <console>:26 []

val fileScanRDD = q.queryExecution.toRdd.dependencies.head.rdd

// What FileSourceScanExec uses for the input RDD is exactly the first RDD in the lineage
assert(inputRDD == fileScanRDD)

val q = spark.read.text("README.md")

val sparkPlan = q.queryExecution.executedPlan

import org.apache.spark.sql.execution.FileSourceScanExec

val scan = sparkPlan.collectFirst { case exec: FileSourceScanExec => exec }.get

val inputRDD = scan.inputRDDs.head

val rdd = q.queryExecution.toRdd

scala> println(rdd.toDebugString)

(1) MapPartitionsRDD[1] at toRdd at <console>:26 []

| FileScanRDD[0] at toRdd at <console>:26 []

val fileScanRDD = q.queryExecution.toRdd.dependencies.head.rdd

// What FileSourceScanExec uses for the input RDD is exactly the first RDD in the lineage

assert(inputRDD == fileScanRDD)

Table 1. FileScanRDD’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`ignoreCorruptFiles`	spark.sql.files.ignoreCorruptFiles Used exclusively when `FileScanRDD` is requested to compute a partition
`ignoreMissingFiles`	spark.sql.files.ignoreMissingFiles Used exclusively when `FileScanRDD` is requested to compute a partition

`getPreferredLocations` Method



getPreferredLocations(split: RDDPartition): Seq[String]

getPreferredLocations(split: RDDPartition): Seq[String]

Note	`getPreferredLocations` is part of the RDD Contract to…FIXME.

getPreferredLocations…FIXME

`getPartitions` Method



getPartitions: Array[RDDPartition]

getPartitions: Array[RDDPartition]

Note	`getPartitions` is part of the RDD Contract to…FIXME.

getPartitions…FIXME

Creating FileScanRDD Instance

FileScanRDD takes the following when created:

SparkSession
Read function that takes a PartitionedFile and gives internal rows back (i.e. (PartitionedFile) ⇒ Iterator[InternalRow])
FilePartitions

Computing Partition (in TaskContext) — `compute` Method



compute(split: RDDPartition, context: TaskContext): Iterator[InternalRow]

compute(split: RDDPartition, context: TaskContext): Iterator[InternalRow]

Note	`compute` is part of Spark Core’s `RDD` Contract to compute a partition (in a `TaskContext`).

compute creates a Scala Iterator (of Java Objects) that…FIXME

compute then requests the input TaskContext to register a completion listener to be executed when a task completes (i.e. addTaskCompletionListener) that simply closes the iterator.

In the end, compute returns the iterator.

FileScanRDD — Input RDD of FileSourceScanExec Physical Operator