VectorizedParquetRecordReader
VectorizedParquetRecordReader
is a SpecificParquetRecordReaderBase for parquet file format that directly materialize to Java Objects
.
VectorizedParquetRecordReader
is created exclusively when ParquetFileFormat
is requested to build a data reader with partition column values appended (when spark.sql.parquet.enableVectorizedReader configuration property is enabled and the result schema uses AtomicType data types only).
Note
|
spark.sql.parquet.enableVectorizedReader configuration property is on by default.
|
VectorizedParquetRecordReader
uses OFF_HEAP
memory mode when spark.sql.columnVector.offheap.enabled internal configuration property is enabled (which is not by default).
VectorizedParquetRecordReader
uses 4 * 1024
for capacity.
Name | Description |
---|---|
Allocated |
|
Memory mode of the ColumnarBatch
Used exclusively when |
|
Bitmap of columns (per index) that are missing (or simply the ones that the reader should not read) |
nextKeyValue
Method
1 2 3 4 5 |
boolean nextKeyValue() throws IOException |
Note
|
nextKeyValue is part of Hadoop’s RecordReader to read (key, value) pairs from a Hadoop InputSplit to present a record-oriented view.
|
nextKeyValue
…FIXME
Note
|
nextKeyValue is used when…FIXME
|
resultBatch
Method
1 2 3 4 5 |
ColumnarBatch resultBatch() |
resultBatch
gives columnarBatch if available or does initBatch.
Note
|
resultBatch is used exclusively when VectorizedParquetRecordReader is requested to nextKeyValue.
|
Creating VectorizedParquetRecordReader Instance
VectorizedParquetRecordReader
takes the following when created:
-
useOffHeap
flag (per spark.sql.columnVector.offheap.enabled configuration property)
VectorizedParquetRecordReader
initializes the internal registries and counters.
initialize
Method
1 2 3 4 5 |
void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) |
Note
|
initialize is part of SpecificParquetRecordReaderBase Contract to…FIXME.
|
initialize
…FIXME
enableReturningBatches
Method
1 2 3 4 5 |
void enableReturningBatches() |
enableReturningBatches
…FIXME
Note
|
enableReturningBatches is used when…FIXME
|
initBatch
Method
1 2 3 4 5 6 7 8 9 10 11 |
void initBatch(StructType partitionColumns, InternalRow partitionValues) (1) // private private void initBatch() (2) private void initBatch( MemoryMode memMode, StructType partitionColumns, InternalRow partitionValues) |
-
Uses MEMORY_MODE
-
Uses MEMORY_MODE and no
partitionColumns
and nopartitionValues
initBatch
creates the batch schema that is sparkSchema and the input partitionColumns
schema.
initBatch
requests OffHeapColumnVector or OnHeapColumnVector to allocate column vectors per the input memMode
, i.e. OFF_HEAP or ON_HEAP memory modes, respectively. initBatch
records the allocated column vectors as the internal WritableColumnVectors.
Note
|
spark.sql.columnVector.offheap.enabled configuration property controls OFF_HEAP or ON_HEAP memory modes, i.e.
|
initBatch
creates a ColumnarBatch (with the allocated WritableColumnVectors) and records it as the internal ColumnarBatch.
initBatch
creates new slots in the allocated WritableColumnVectors for the input partitionColumns
and sets the input partitionValues
as constants.
initBatch
initializes missing columns with nulls
.
Note
|
|