关注 spark技术分享,
撸spark源码 玩spark最佳实践

FileScanRDD — Input RDD of FileSourceScanExec Physical Operator

FileScanRDD — Input RDD of FileSourceScanExec Physical Operator

FileScanRDD is an RDD of internal binary rows (i.e. RDD[InternalRow]) that is the one and only input RDD of FileSourceScanExec physical operator.

FileScanRDD is created exclusively when FileSourceScanExec physical operator is requested to createBucketedReadRDD and createNonBucketedReadRDD (which is when FileSourceScanExec is requested for the input RDD that WholeStageCodegenExec physical operator uses when executed).

Table 1. FileScanRDD’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

ignoreCorruptFiles

spark.sql.files.ignoreCorruptFiles

Used exclusively when FileScanRDD is requested to compute a partition

ignoreMissingFiles

spark.sql.files.ignoreMissingFiles

Used exclusively when FileScanRDD is requested to compute a partition

getPreferredLocations Method

Note
getPreferredLocations is part of the RDD Contract to…​FIXME.

getPreferredLocations…​FIXME

getPartitions Method

Note
getPartitions is part of the RDD Contract to…​FIXME.

getPartitions…​FIXME

Creating FileScanRDD Instance

FileScanRDD takes the following when created:

Computing Partition (in TaskContext) — compute Method

Note
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).

compute creates a Scala Iterator (of Java Objects) that…​FIXME

compute then requests the input TaskContext to register a completion listener to be executed when a task completes (i.e. addTaskCompletionListener) that simply closes the iterator.

In the end, compute returns the iterator.

赞(0) 打赏
未经允许不得转载:spark技术分享 » FileScanRDD — Input RDD of FileSourceScanExec Physical Operator
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏