PartitionedFile — Part of Single File
PartitionedFile
is a part (aka block) of a single file that should be read, along with partition column values that need to be appended to each row.
Note
|
Partition column values are values of the columns that are column partitions and therefore part of the directory structure not the partitioned files themselves (that together are the partitioned dataset). |
PartitionedFile
is created exclusively when FileSourceScanExec
is requested to create RDDs for bucketed or non-bucketed reads.
PartitionedFile
uses the following text representation (i.e. toString
):
1 2 3 4 5 |
path: [filePath], range: [start]-[end], partition values: [partitionValues] |
1 2 3 4 5 6 7 8 9 10 |
import org.apache.spark.sql.execution.datasources.PartitionedFile import org.apache.spark.sql.catalyst.InternalRow val partFile = PartitionedFile(InternalRow.empty, "fakePath0", 0, 10, Array("host0", "host1")) scala> println(partFile) path: fakePath0, range: 0-10, partition values: [empty row] |
Creating PartitionedFile Instance
PartitionedFile
takes the following when created:
-
Partition column values to be appended to each row (as an internal row)
-
Locality information as a list of nodes that have the data (aka
locations
). Empty by default.