DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator
DataSourceRDD is an RDD that is created exclusively when DataSourceV2ScanExec physical operator is requested for the input RDD (when WholeStageCodegenExec physical operator is executed).
DataSourceRDD uses DataSourceRDDPartition partitions.
Requesting Preferred Locations Of DataReaderFactory (For Partition) — getPreferredLocations Method
|
1 2 3 4 5 |
getPreferredLocations(split: Partition): Seq[String] |
|
Note
|
getPreferredLocations is part of Spark Core’s RDD Contract to…FIXME.
|
getPreferredLocations simply requests the preferred locations of the DataReaderFactory of the input DataSourceRDDPartition partition.
getPartitions Method
|
1 2 3 4 5 |
getPartitions: Array[Partition] |
|
Note
|
getPartitions is part of Spark Core’s RDD Contract to…FIXME
|
getPartitions simply creates a DataSourceRDDPartition for every DataReaderFactory in the readerFactories.
Creating DataSourceRDD Instance
DataSourceRDD takes the following when created:
-
Collection of DataReaderFactory objects
Computing Partition (in TaskContext) — compute Method
|
1 2 3 4 5 |
compute(split: Partition, context: TaskContext): Iterator[T] |
|
Note
|
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).
|
compute requests the DataReaderFactory (of the DataSourceRDDPartition partition) to createDataReader.
compute registers a Spark Core TaskCompletionListener that requests the DataReader to close at a task completion.
compute returns a Spark Core InterruptibleIterator that…FIXME
spark技术分享