FileFormat — Data Sources to Read and Write Data In Files
FileFormat is the contract for data sources that read and write data stored in files.
| Method | Description | ||||
|---|---|---|---|---|---|
|
|
Builds a Catalyst data reader, i.e. a function that reads a PartitionedFile file as InternalRows.
Used exclusively when |
||||
|
|
|
||||
|
|
Infers (returns) the schema of the given files (as Hadoop’s FileStatuses) if supported. Otherwise, Used when:
|
||||
|
|
Controls whether the format (under the given path as Hadoop Path) can be split or not.
Used exclusively when |
||||
|
|
Prepares a write job and returns an Used exclusively when |
||||
|
|
Flag that says whether the format supports columnar batch (i.e. vectorized decoding) or not.
Used exclusively when |
||||
|
|
Used exclusively when |
| FileFormat | Description |
|---|---|
Building Data Reader With Partition Column Values Appended — buildReaderWithPartitionValues Method
|
1 2 3 4 5 6 7 8 9 10 11 12 |
buildReaderWithPartitionValues( sparkSession: SparkSession, dataSchema: StructType, partitionSchema: StructType, requiredSchema: StructType, filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] |
buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function from buildReader.
Internally, buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:
-
Creates a converter by requesting
GenerateUnsafeProjectionto generate an UnsafeProjection for the attributes of the inputrequiredSchemaandpartitionSchema -
Applies the data reader to a
PartitionedFileand converts the result using the converter on the joined row with the partition column values appended.
|
Note
|
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for the input RDDs.
|
spark技术分享