FileScanRDD — Input RDD of FileSourceScanExec Physical Operator
FileScanRDD is an RDD of internal binary rows (i.e. RDD[InternalRow]) that is the one and only input RDD of FileSourceScanExec physical operator.
FileScanRDD is created exclusively when FileSourceScanExec physical operator is requested to createBucketedReadRDD and createNonBucketedReadRDD (which is when FileSourceScanExec is requested for the input RDD that WholeStageCodegenExec physical operator uses when executed).
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
val q = spark.read.text("README.md") val sparkPlan = q.queryExecution.executedPlan import org.apache.spark.sql.execution.FileSourceScanExec val scan = sparkPlan.collectFirst { case exec: FileSourceScanExec => exec }.get val inputRDD = scan.inputRDDs.head val rdd = q.queryExecution.toRdd scala> println(rdd.toDebugString) (1) MapPartitionsRDD[1] at toRdd at <console>:26 [] | FileScanRDD[0] at toRdd at <console>:26 [] val fileScanRDD = q.queryExecution.toRdd.dependencies.head.rdd // What FileSourceScanExec uses for the input RDD is exactly the first RDD in the lineage assert(inputRDD == fileScanRDD) |
| Name | Description |
|---|---|
|
|
spark.sql.files.ignoreCorruptFiles Used exclusively when |
|
|
spark.sql.files.ignoreMissingFiles Used exclusively when |
getPreferredLocations Method
|
1 2 3 4 5 |
getPreferredLocations(split: RDDPartition): Seq[String] |
|
Note
|
getPreferredLocations is part of the RDD Contract to…FIXME.
|
getPreferredLocations…FIXME
getPartitions Method
|
1 2 3 4 5 |
getPartitions: Array[RDDPartition] |
|
Note
|
getPartitions is part of the RDD Contract to…FIXME.
|
getPartitions…FIXME
Creating FileScanRDD Instance
FileScanRDD takes the following when created:
-
Read function that takes a PartitionedFile and gives internal rows back (i.e.
(PartitionedFile) ⇒ Iterator[InternalRow])
Computing Partition (in TaskContext) — compute Method
|
1 2 3 4 5 |
compute(split: RDDPartition, context: TaskContext): Iterator[InternalRow] |
|
Note
|
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).
|
compute creates a Scala Iterator (of Java Objects) that…FIXME
compute then requests the input TaskContext to register a completion listener to be executed when a task completes (i.e. addTaskCompletionListener) that simply closes the iterator.
In the end, compute returns the iterator.
spark技术分享