关注 spark技术分享,
撸spark源码 玩spark最佳实践

VectorizedParquetRecordReader

VectorizedParquetRecordReader

VectorizedParquetRecordReader is a SpecificParquetRecordReaderBase for parquet file format that directly materialize to Java Objects.

VectorizedParquetRecordReader is created exclusively when ParquetFileFormat is requested to build a data reader with partition column values appended (when spark.sql.parquet.enableVectorizedReader configuration property is enabled and the result schema uses AtomicType data types only).

Note

spark.sql.parquet.enableVectorizedReader configuration property is on by default.

VectorizedParquetRecordReader uses OFF_HEAP memory mode when spark.sql.columnVector.offheap.enabled internal configuration property is enabled (which is not by default).

VectorizedParquetRecordReader uses 4 * 1024 for capacity.

Table 1. VectorizedParquetRecordReader’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

columnarBatch

ColumnarBatch

columnVectors

Allocated WritableColumnVectors

MEMORY_MODE

Memory mode of the ColumnarBatch

Used exclusively when VectorizedParquetRecordReader is requested to initBatch.

missingColumns

Bitmap of columns (per index) that are missing (or simply the ones that the reader should not read)

nextKeyValue Method

Note
nextKeyValue is part of Hadoop’s RecordReader to read (key, value) pairs from a Hadoop InputSplit to present a record-oriented view.

nextKeyValue…​FIXME

Note
nextKeyValue is used when…​FIXME

resultBatch Method

resultBatch gives columnarBatch if available or does initBatch.

Note
resultBatch is used exclusively when VectorizedParquetRecordReader is requested to nextKeyValue.

Creating VectorizedParquetRecordReader Instance

VectorizedParquetRecordReader takes the following when created:

VectorizedParquetRecordReader initializes the internal registries and counters.

initialize Method

Note
initialize is part of SpecificParquetRecordReaderBase Contract to…​FIXME.

initialize…​FIXME

enableReturningBatches Method

enableReturningBatches…​FIXME

Note
enableReturningBatches is used when…​FIXME

initBatch Method

  1. Uses MEMORY_MODE

  2. Uses MEMORY_MODE and no partitionColumns and no partitionValues

initBatch creates the batch schema that is sparkSchema and the input partitionColumns schema.

initBatch requests OffHeapColumnVector or OnHeapColumnVector to allocate column vectors per the input memMode, i.e. OFF_HEAP or ON_HEAP memory modes, respectively. initBatch records the allocated column vectors as the internal WritableColumnVectors.

Note

spark.sql.columnVector.offheap.enabled configuration property controls OFF_HEAP or ON_HEAP memory modes, i.e. true or false, respectively.

spark.sql.columnVector.offheap.enabled is disabled by default which means that OnHeapColumnVector is used.

initBatch creates a ColumnarBatch (with the allocated WritableColumnVectors) and records it as the internal ColumnarBatch.

initBatch creates new slots in the allocated WritableColumnVectors for the input partitionColumns and sets the input partitionValues as constants.

initBatch initializes missing columns with nulls.

Note

initBatch is used when:

赞(0) 打赏
未经允许不得转载:spark技术分享 » VectorizedParquetRecordReader
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏