关注 spark技术分享,
撸spark源码 玩spark最佳实践

DataSourceV2ScanExec

DataSourceV2ScanExec Leaf Physical Operator

DataSourceV2ScanExec is a leaf physical operator to represent DataSourceV2Relation logical operators at execution time.

Note
A DataSourceV2Relation logical operator is created when…​FIXME

DataSourceV2ScanExec is a ColumnarBatchScan that supports vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).

DataSourceV2ScanExec is also a DataSourceReaderHolder.

DataSourceV2ScanExec is created exclusively when DataSourceV2Strategy execution planning strategy is executed and finds a DataSourceV2Relation logical operator in a logical query plan.

DataSourceV2ScanExec gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec physical operator is executed).

Table 1. DataSourceV2ScanExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

readerFactories

Collection of DataReaderFactory objects of UnsafeRows

Used when…​FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

Note
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).

doExecute…​FIXME

supportsBatch Property

Note
supportsBatch is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.

supportsBatch is enabled (i.e. true) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.

Note
enableBatchRead flag is enabled by default.

supportsBatch is disabled (i.e. false) otherwise.

Creating DataSourceV2ScanExec Instance

DataSourceV2ScanExec takes the following when created:

DataSourceV2ScanExec initializes the internal registries and counters.

Creating Input RDD of Internal Rows — inputRDD Internal Property

Note
inputRDD is a Scala lazy value which is computed once when accessed and cached afterwards.

inputRDD…​FIXME

Note
inputRDD is used when DataSourceV2ScanExec physical operator is requested for the input RDDs and to execute.
赞(0) 打赏
未经允许不得转载:spark技术分享 » DataSourceV2ScanExec
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏