VectorizedParquetRecordReader-spark技术分享

VectorizedParquetRecordReader

VectorizedParquetRecordReader is a SpecificParquetRecordReaderBase for parquet file format that directly materialize to Java Objects.

VectorizedParquetRecordReader is created exclusively when ParquetFileFormat is requested to build a data reader with partition column values appended (when spark.sql.parquet.enableVectorizedReader configuration property is enabled and the result schema uses AtomicType data types only).

Note

spark.sql.parquet.enableVectorizedReader configuration property is on by default.



val isParquetVectorizedReaderEnabled = spark.conf.get("spark.sql.parquet.enableVectorizedReader").toBoolean
assert(isParquetVectorizedReaderEnabled, "spark.sql.parquet.enableVectorizedReader should be enabled by default")

val isParquetVectorizedReaderEnabled = spark.conf.get("spark.sql.parquet.enableVectorizedReader").toBoolean

assert(isParquetVectorizedReaderEnabled, "spark.sql.parquet.enableVectorizedReader should be enabled by default")

VectorizedParquetRecordReader uses OFF_HEAP memory mode when spark.sql.columnVector.offheap.enabled internal configuration property is enabled (which is not by default).

VectorizedParquetRecordReader uses 4 * 1024 for capacity.

Table 1. VectorizedParquetRecordReader’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`columnarBatch`	ColumnarBatch
`columnVectors`	Allocated `WritableColumnVectors`
`MEMORY_MODE`	Memory mode of the ColumnarBatch `OFF_HEAP` (when useOffHeap is on as per spark.sql.columnVector.offheap.enabled configuration property) `ON_HEAP` Used exclusively when `VectorizedParquetRecordReader` is requested to initBatch.
`missingColumns`	Bitmap of columns (per index) that are missing (or simply the ones that the reader should not read)

`nextKeyValue` Method



boolean nextKeyValue() throws IOException

boolean nextKeyValue() throws IOException

Note	`nextKeyValue` is part of Hadoop’s RecordReader to read (key, value) pairs from a Hadoop InputSplit to present a record-oriented view.

nextKeyValue…FIXME

Note	`nextKeyValue` is used when…FIXME

`resultBatch` Method



ColumnarBatch resultBatch()

ColumnarBatch resultBatch()

resultBatch gives columnarBatch if available or does initBatch.

Note	`resultBatch` is used exclusively when `VectorizedParquetRecordReader` is requested to nextKeyValue.

Creating VectorizedParquetRecordReader Instance

VectorizedParquetRecordReader takes the following when created:

TimeZone (null when no timezone conversion is expected)
useOffHeap flag (per spark.sql.columnVector.offheap.enabled configuration property)

VectorizedParquetRecordReader initializes the internal registries and counters.

`initialize` Method



void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext)

void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext)

Note	`initialize` is part of SpecificParquetRecordReaderBase Contract to…FIXME.

initialize…FIXME

`enableReturningBatches` Method



void enableReturningBatches()

void enableReturningBatches()

enableReturningBatches…FIXME

Note	`enableReturningBatches` is used when…FIXME

`initBatch` Method



void initBatch(StructType partitionColumns, InternalRow partitionValues) (1)
// private
private void initBatch() (2)
private void initBatch(
  MemoryMode memMode,
  StructType partitionColumns,
  InternalRow partitionValues)

void initBatch(StructType partitionColumns, InternalRow partitionValues) (1)

// private

private void initBatch() (2)

private void initBatch(

MemoryMode memMode,

StructType partitionColumns,

InternalRow partitionValues)

Uses MEMORY_MODE
Uses MEMORY_MODE and no partitionColumns and no partitionValues

initBatch creates the batch schema that is sparkSchema and the input partitionColumns schema.

initBatch requests OffHeapColumnVector or OnHeapColumnVector to allocate column vectors per the input memMode, i.e. OFF_HEAP or ON_HEAP memory modes, respectively. initBatch records the allocated column vectors as the internal WritableColumnVectors.

Note	spark.sql.columnVector.offheap.enabled configuration property controls OFF_HEAP or ON_HEAP memory modes, i.e. `true` or `false`, respectively. `spark.sql.columnVector.offheap.enabled` is disabled by default which means that OnHeapColumnVector is used.

initBatch creates a ColumnarBatch (with the allocated WritableColumnVectors) and records it as the internal ColumnarBatch.

initBatch creates new slots in the allocated WritableColumnVectors for the input partitionColumns and sets the input partitionValues as constants.

initBatch initializes missing columns with nulls.

Note	`initBatch` is used when: `VectorizedParquetRecordReader` is requested for resultBatch `ParquetFileFormat` is requested to build a data reader with partition column values appended

VectorizedParquetRecordReader