关注 spark技术分享,
撸spark源码 玩spark最佳实践

ColumnarBatchScan Contract — Physical Operators With Vectorized Reader

ColumnarBatchScan Contract — Physical Operators With Vectorized Reader

ColumnarBatchScan is an extension of CodegenSupport contract for physical operators that support columnar batch scan (aka vectorized reader).

ColumnarBatchScan uses the supportsBatch flag that is enabled (i.e. true) by default. It is expected that physical operators would override it to support vectorized decoding only when specific conditions are met (i.e. FileSourceScanExec, InMemoryTableScanExec and DataSourceV2ScanExec physical operators).

ColumnarBatchScan uses the needsUnsafeRowConversion flag to control the name of the variable for an input row while generating the Java source code to consume generated columns or row from a physical operator that is used while generating the Java source code for producing rows. needsUnsafeRowConversion flag is enabled (i.e. true) by default that gives no name for the row term.

Table 1. ColumnarBatchScan’s Performance Metrics
Key Name (in web UI) Description

numOutputRows

number of output rows

scanTime

scan time

Table 2. ColumnarBatchScans
ColumnarBatchScan Description

DataSourceV2ScanExec

FileSourceScanExec

InMemoryTableScanExec

genCodeColumnVector Internal Method

genCodeColumnVector…​FIXME

Note
genCodeColumnVector is used exclusively when ColumnarBatchScan is requested to produceBatches.

Generating Java Source Code to Produce Batches — produceBatches Internal Method

produceBatches gives the Java source code to produce batches…​FIXME

Note
produceBatches is used exclusively when ColumnarBatchScan is requested to generate the Java source code for produce path in whole-stage code generation (when supportsBatch flag is on).

supportsBatch Method

supportsBatch flag controls whether a FileFormat supports vectorized decoding or not. supportsBatch is enabled (i.e. true) by default.

Note

supportsBatch is used when:

Generating Java Source Code for Produce Path in Whole-Stage Code Generation — doProduce Method

Note
doProduce is part of CodegenSupport Contract to generate the Java source code for produce path in Whole-Stage Code Generation.

doProduce firstly requests the input CodegenContext to add a mutable state for the first input RDD of a physical operator.

doProduce produceBatches when supportsBatch is enabled or produceRows.

Note
supportsBatch is enabled by default unless overriden by a physical operator.

Generating Java Source Code for Producing Rows — produceRows Internal Method

produceRows creates a new metric term for the numOutputRows metric.

produceRows creates a fresh term name for a row variable and assigns it as the name of the INPUT_ROW.

produceRows resets (nulls) currentVars.

For every output schema attribute, produceRows creates a BoundReference and requests it to generate code for expression evaluation.

produceRows selects the name of the row term per needsUnsafeRowConversion flag.

produceRows generates the Java source code to consume generated columns or row from the current physical operator and uses it to generate the final Java source code for producing rows.

Note
produceRows is used exclusively when ColumnarBatchScan is requested to generate the Java source code for produce path in whole-stage code generation (when supportsBatch flag is off).

vectorTypes Method

vectorTypes are the class names of concrete ColumnVectors for every column used in a columnar batch.

vectorTypes gives no vector types by default.

Note
vectorTypes is used exclusively when ColumnarBatchScan is requested to produceBatches.
赞(0) 打赏
未经允许不得转载:spark技术分享 » ColumnarBatchScan Contract — Physical Operators With Vectorized Reader
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏