关注 spark技术分享,
撸spark源码 玩spark最佳实践

FileFormat

FileFormat — Data Sources to Read and Write Data In Files

FileFormat is the contract for data sources that read and write data stored in files.

Table 1. FileFormat Contract
Method Description

buildReader

Builds a Catalyst data reader, i.e. a function that reads a PartitionedFile file as InternalRows.

buildReader throws an UnsupportedOperationException by default (and should therefore be overriden to work):

Used exclusively when FileFormat is requested to buildReaderWithPartitionValues

buildReaderWithPartitionValues

buildReaderWithPartitionValues builds a data reader with partition column values appended, i.e. a function that is used to read a single file in (as a PartitionedFile) as an Iterator of InternalRows (like buildReader) with the partition values appended.

Used exclusively when FileSourceScanExec physical operator is requested for the inputRDD (when requested for the inputRDDs and execution)

inferSchema

Infers (returns) the schema of the given files (as Hadoop’s FileStatuses) if supported. Otherwise, None should be returned.

Used when:

isSplitable

Controls whether the format (under the given path as Hadoop Path) can be split or not.

isSplitable is disabled (false) by default.

Used exclusively when FileSourceScanExec physical operator is requested to create an RDD for non-bucketed reads (when requested for the inputRDD and neither the optional bucketing specification of the HadoopFsRelation is defined nor bucketing is enabled)

prepareWrite

Prepares a write job and returns an OutputWriterFactory

Used exclusively when FileFormatWriter is requested to write query result

supportBatch

Flag that says whether the format supports columnar batch (i.e. vectorized decoding) or not.

isSplitable is off (false) by default.

Used exclusively when FileSourceScanExec physical operator is requested for the supportsBatch

vectorTypes

vectorTypes is the concrete column vector class names for each column to be used in a columnar batch when enabled

vectorTypes is undefined (None) by default.

Used exclusively when FileSourceScanExec physical operator is requested for the vectorTypes

Table 2. FileFormats (Direct Implementations and Extensions)
FileFormat Description

AvroFileFormat

Avro data source

HiveFileFormat

Writes hive tables

OrcFileFormat

ORC data source

ParquetFileFormat

Parquet data source

TextBasedFileFormat

Base for text splitable FileFormats

Building Data Reader With Partition Column Values Appended — buildReaderWithPartitionValues Method

buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function from buildReader.

Internally, buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:

  1. Creates a converter by requesting GenerateUnsafeProjection to generate an UnsafeProjection for the attributes of the input requiredSchema and partitionSchema

  2. Applies the data reader to a PartitionedFile and converts the result using the converter on the joined row with the partition column values appended.

Note
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for the input RDDs.
赞(0) 打赏
未经允许不得转载:spark技术分享 » FileFormat
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏