FileFormat — Data Sources to Read and Write Data In Files
FileFormat
is the contract for data sources that read and write data stored in files.
Method | Description | ||||
---|---|---|---|---|---|
|
Builds a Catalyst data reader, i.e. a function that reads a PartitionedFile file as InternalRows.
Used exclusively when |
||||
|
|
||||
|
Infers (returns) the schema of the given files (as Hadoop’s FileStatuses) if supported. Otherwise, Used when:
|
||||
|
Controls whether the format (under the given path as Hadoop Path) can be split or not.
Used exclusively when |
||||
|
Prepares a write job and returns an Used exclusively when |
||||
|
Flag that says whether the format supports columnar batch (i.e. vectorized decoding) or not.
Used exclusively when |
||||
|
Used exclusively when |
FileFormat | Description |
---|---|
Building Data Reader With Partition Column Values Appended — buildReaderWithPartitionValues
Method
1 2 3 4 5 6 7 8 9 10 11 12 |
buildReaderWithPartitionValues( sparkSession: SparkSession, dataSchema: StructType, partitionSchema: StructType, requiredSchema: StructType, filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] |
buildReaderWithPartitionValues
is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function from buildReader.
Internally, buildReaderWithPartitionValues
builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]
) that does the following:
-
Creates a converter by requesting
GenerateUnsafeProjection
to generate an UnsafeProjection for the attributes of the inputrequiredSchema
andpartitionSchema
-
Applies the data reader to a
PartitionedFile
and converts the result using the converter on the joined row with the partition column values appended.
Note
|
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for the input RDDs.
|