关注 spark技术分享,
撸spark源码 玩spark最佳实践

HadoopFileLinesReader

admin阅读(1018)

HadoopFileLinesReader

HadoopFileLinesReader is a Scala Iterator of Apache Hadoop’s org.apache.hadoop.io.Text.

HadoopFileLinesReader is created to access datasets in the following data sources:

  • SimpleTextSource

  • LibSVMFileFormat

  • TextInputCSVDataSource

  • TextInputJsonDataSource

  • TextFileFormat

HadoopFileLinesReader uses the internal iterator that handles accessing files using Hadoop’s FileSystem API.

Creating HadoopFileLinesReader Instance

HadoopFileLinesReader takes the following when created:

iterator Internal Property

When created, HadoopFileLinesReader creates an internal iterator that uses Hadoop’s org.apache.hadoop.mapreduce.lib.input.FileSplit with Hadoop’s org.apache.hadoop.fs.Path and file.

iterator creates Hadoop’s TaskAttemptID, TaskAttemptContextImpl and LineRecordReader.

iterator initializes LineRecordReader and passes it on to RecordReaderIterator.

Note
iterator is used for Iterator-specific methods, i.e. hasNext, next and close.

OffHeapColumnVector

admin阅读(1381)

OffHeapColumnVector

OffHeapColumnVector is a concrete WritableColumnVector that…​FIXME

Allocating Column Vectors — allocateColumns Static Method

  1. Simply converts StructType to StructField[] and calls the other allocateColumns

allocateColumns creates an array of OffHeapColumnVector for every field (to hold capacity number of elements of the data type per field).

Note
allocateColumns is used when…​FIXME

OnHeapColumnVector

admin阅读(1421)

OnHeapColumnVector

OnHeapColumnVector is a concrete WritableColumnVector that…​FIXME

OnHeapColumnVector is created when:

Allocating Column Vectors — allocateColumns Static Method

  1. Simply converts StructType to StructField[] and calls the other allocateColumns

allocateColumns creates an array of OnHeapColumnVector for every field (to hold capacity number of elements of the data type per field).

Note

allocateColumns is used when:

  • AggregateHashMap is created

  • InMemoryTableScanExec is requested to createAndDecompressColumn

  • VectorizedParquetRecordReader is requested to initBatch (with ON_HEAP memory mode)

  • OrcColumnarBatchReader is requested to initBatch (with ON_HEAP memory mode)

  • ColumnVectorUtils is requested to convert an iterator of rows into a single ColumnBatch (aka toBatch)

Creating OnHeapColumnVector Instance

OnHeapColumnVector takes the following when created:

  • Number of elements to hold in a vector (aka capacity)

  • Data type of the elements stored

When created, OnHeapColumnVector reserveInternal (for the given capacity) and reset.

reserveInternal Method

Note
reserveInternal is part of WritableColumnVector Contract to…​FIXME.

reserveInternal…​FIXME

reserveNewColumn Method

Note
reserveNewColumn is part of WritableColumnVector Contract to…​FIXME.

reserveNewColumn…​FIXME

WritableColumnVector

admin阅读(1404)

WritableColumnVector

WritableColumnVector is the contract for…​FIXME

Table 1. (Subset of) WritableColumnVector Contract
Method Description

reserveInternal

Used when:

reserveNewColumn

Used when WritableColumnVector is created and requested to reserveDictionaryIds

Table 2. WritableColumnVectors
WritableColumnVector Description

OffHeapColumnVector

OnHeapColumnVector

reset Method

reset…​FIXME

Note
reset is used when…​FIXME

reserve Method

reserve…​FIXME

Note
reserve is used when…​FIXME

reserveDictionaryIds Method

reserveDictionaryIds…​FIXME

Note
reserveDictionaryIds is used when…​FIXME

Creating WritableColumnVector Instance

WritableColumnVector takes the following when created:

  • Number of elements to hold in a vector (aka capacity)

  • Data type of the elements stored

WritableColumnVector initializes the internal registries and counters.

WritableColumnVector…​FIXME

Note
WritableColumnVector is a Java abstract class and cannot be created directly.

ColumnVector

admin阅读(1542)

ColumnVector

ColumnVector is the contract for…​FIXME

ColumnVector has a data type that you can access using dataType method.

Note

ColumnVector is an Evolving contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release.

In other words, using the contract is as treading on thin ice.

Table 1. (Subset of) ColumnVector Contract
Method Description

getChild

Used when…​FIXME

Table 2. ColumnVectors
ColumnVector Description

ArrowColumnVector

OffHeapColumnVector

OnHeapColumnVector

OrcColumnVector

WritableColumnVector

PartitioningUtils

admin阅读(1192)

PartitioningUtils Helper Object

PartitioningUtils is…​FIXME

validatePartitionColumn Method

validatePartitionColumn…​FIXME

Note
validatePartitionColumn is used when…​FIXME

parsePartitions Method

  1. Uses parsePartitions with timeZoneId mapped to a TimeZone

parsePartitions…​FIXME

Note
parsePartitions is used when…​FIXME

PartitionedFile — Part of Single File

admin阅读(1764)

PartitionedFile — Part of Single File

PartitionedFile is a part (aka block) of a single file that should be read, along with partition column values that need to be appended to each row.

Note
Partition column values are values of the columns that are column partitions and therefore part of the directory structure not the partitioned files themselves (that together are the partitioned dataset).

PartitionedFile is created exclusively when FileSourceScanExec is requested to create RDDs for bucketed or non-bucketed reads.

PartitionedFile uses the following text representation (i.e. toString):

Creating PartitionedFile Instance

PartitionedFile takes the following when created:

  • Partition column values to be appended to each row (as an internal row)

  • Path of the file to read

  • Beginning offset (in bytes)

  • Number of bytes to read (aka length)

  • Locality information as a list of nodes that have the data (aka locations). Empty by default.

HashedRelationBroadcastMode

admin阅读(1697)

HashedRelationBroadcastMode

HashedRelationBroadcastMode is a BroadcastMode that BroadcastHashJoinExec uses for the required output distribution of child operators.

HashedRelationBroadcastMode takes build-side join keys (as Catalyst expressions) when created.

HashedRelationBroadcastMode gives a copy of itself with keys canonicalized when requested for a canonicalized version.

transform Method

  1. Uses the other transform with the size of rows as sizeHint

Note
transform is part of BroadcastMode Contract to…​FIXME.

transform…​FIXME

BroadcastMode

admin阅读(1531)

BroadcastMode

BroadcastMode is the contract for…​FIXME

Table 1. BroadcastMode Contract
Method Description

canonicalized

Used when…​FIXME

transform

Used when:

Note
The rows-only variant does not seem to be used at all.
Table 2. BroadcastModes
BroadcastMode Description

HashedRelationBroadcastMode

IdentityBroadcastMode

关注公众号:spark技术分享

联系我们联系我们