FileSourceScanExec Leaf Physical Operator
FileSourceScanExec is a leaf physical operator (as a DataSourceScanExec) that represents a scan over collections of files (incl. Hive tables).
FileSourceScanExec is created exclusively for a LogicalRelation logical operator with a HadoopFsRelation when FileSourceStrategy execution planning strategy is executed.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
// Create a bucketed data source table // It is one of the most complex examples of a LogicalRelation with a HadoopFsRelation val tableName = "bucketed_4_id" spark .range(100) .withColumn("part", $"id" % 2) .write .partitionBy("part") .bucketBy(4, "id") .sortBy("id") .mode("overwrite") .saveAsTable(tableName) val q = spark.table(tableName) val sparkPlan = q.queryExecution.executedPlan scala> println(sparkPlan.numberedTreeString) 00 *(1) FileScan parquet default.bucketed_4_id[id#7L,part#8L] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 4 out of 4 import org.apache.spark.sql.execution.FileSourceScanExec val scan = sparkPlan.collectFirst { case exec: FileSourceScanExec => exec }.get scala> :type scan org.apache.spark.sql.execution.FileSourceScanExec scala> scan.metadata.toSeq.sortBy(_._1).map { case (k, v) => s"$k -> $v" }.foreach(println) Batched -> true Format -> Parquet Location -> CatalogFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id] PartitionCount -> 2 PartitionFilters -> [] PushedFilters -> [] ReadSchema -> struct<id:bigint> SelectedBucketsCount -> 4 out of 4 |
FileSourceScanExec supports bucket pruning so it only scans the bucket files required for a query.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
scala> :type scan org.apache.spark.sql.execution.FileSourceScanExec import org.apache.spark.sql.execution.datasources.FileScanRDD val rdd = scan.inputRDDs.head.asInstanceOf[FileScanRDD] import org.apache.spark.sql.execution.datasources.FilePartition val bucketFiles = for { FilePartition(bucketId, files) <- rdd.filePartitions f <- files } yield s"Bucket $bucketId => $f" scala> println(bucketFiles.size) 51 scala> bucketFiles.foreach(println) Bucket 0 => path: file:///Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id/part=0/part-00004-5301d371-01c3-47d4-bb6b-76c3c94f3699_00000.c000.snappy.parquet, range: 0-423, partition values: [0] Bucket 0 => path: file:///Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id/part=0/part-00001-5301d371-01c3-47d4-bb6b-76c3c94f3699_00000.c000.snappy.parquet, range: 0-423, partition values: [0] ... Bucket 3 => path: file:///Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id/part=1/part-00005-5301d371-01c3-47d4-bb6b-76c3c94f3699_00003.c000.snappy.parquet, range: 0-423, partition values: [1] Bucket 3 => path: file:///Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id/part=1/part-00000-5301d371-01c3-47d4-bb6b-76c3c94f3699_00003.c000.snappy.parquet, range: 0-431, partition values: [1] Bucket 3 => path: file:///Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id/part=1/part-00007-5301d371-01c3-47d4-bb6b-76c3c94f3699_00003.c000.snappy.parquet, range: 0-423, partition values: [1] |
FileSourceScanExec uses a HashPartitioning or the default UnknownPartitioning as the output partitioning scheme.
FileSourceScanExec is a ColumnarBatchScan and supports batch decoding only when the FileFormat (of the HadoopFsRelation) supports it.
FileSourceScanExec always gives the single inputRDD as the only RDD of internal rows (in Whole-Stage Java Code Generation).
FileSourceScanExec supports data source filters that are printed out to the console (at INFO logging level) and available as metadata (e.g. in web UI or explain).
|
1 2 3 4 5 |
Pushed Filters: [pushedDownFilters] |
| Key | Name (in web UI) | Description |
|---|---|---|
|
|
metadata time (ms) |
|
|
|
number of files |
|
|
|
number of output rows |
|
|
|
scan time |
As a DataSourceScanExec, FileSourceScanExec uses Scan for the prefix of the node name.
|
1 2 3 4 5 6 |
val fileScanExec: FileSourceScanExec = ... // see the example earlier assert(fileScanExec.nodeName startsWith "Scan") |
FileSourceScanExec uses File for nodeNamePrefix (that is used for the simple node description in query plans).
|
1 2 3 4 5 6 7 8 9 |
val fileScanExec: FileSourceScanExec = ... // see the example earlier assert(fileScanExec.nodeNamePrefix == "File") scala> println(fileScanExec.simpleString) FileScan csv [id#20,name#21,city#22] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/datasets/people.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string,city:string> |
| Name | Description | ||||
|---|---|---|---|---|---|
|
|
Metadata
|
||||
|
|
Data source filters that are dataFilters expressions converted to their respective filters
|
|
Tip
|
Enable Add the following line to
Refer to Logging. |
Creating RDD for Non-Bucketed Reads — createNonBucketedReadRDD Internal Method
|
1 2 3 4 5 6 7 8 |
createNonBucketedReadRDD( readFile: (PartitionedFile) => Iterator[InternalRow], selectedPartitions: Seq[PartitionDirectory], fsRelation: HadoopFsRelation): RDD[InternalRow] |
createNonBucketedReadRDD…FIXME
|
Note
|
createNonBucketedReadRDD is used exclusively when FileSourceScanExec physical operator is requested for the inputRDD (and neither the optional bucketing specification of the HadoopFsRelation is defined nor bucketing is enabled).
|
selectedPartitions Internal Lazy-Initialized Property
|
1 2 3 4 5 |
selectedPartitions: Seq[PartitionDirectory] |
selectedPartitions…FIXME
|
Note
|
|
Creating FileSourceScanExec Instance
FileSourceScanExec takes the following when created:
-
Output schema attributes
-
partitionFiltersexpressions -
dataFiltersexpressions
FileSourceScanExec initializes the internal registries and counters.
Output Partitioning Scheme — outputPartitioning Attribute
|
1 2 3 4 5 |
outputPartitioning: Partitioning |
|
Note
|
outputPartitioning is part of the SparkPlan Contract to specify output data partitioning.
|
outputPartitioning can be one of the following:
-
HashPartitioning (with the bucket column names and the number of buckets of the bucketing specification of the HadoopFsRelation) when bucketing is enabled and the HadoopFsRelation has a bucketing specification defined
-
UnknownPartitioning (with
0partitions) otherwise
Creating FileScanRDD with Bucketing Support — createBucketedReadRDD Internal Method
|
1 2 3 4 5 6 7 8 9 |
createBucketedReadRDD( bucketSpec: BucketSpec, readFile: (PartitionedFile) => Iterator[InternalRow], selectedPartitions: Seq[PartitionDirectory], fsRelation: HadoopFsRelation): RDD[InternalRow] |
createBucketedReadRDD prints the following INFO message to the logs:
|
1 2 3 4 5 |
Planning with [numBuckets] buckets |
createBucketedReadRDD maps the available files of the input selectedPartitions into PartitionedFiles. For every file, createBucketedReadRDD getBlockLocations and getBlockHosts.
createBucketedReadRDD then groups the PartitionedFiles by bucket ID.
|
Note
|
Bucket ID is of the format _0000n, i.e. the bucket ID prefixed with up to four 0s.
|
createBucketedReadRDD prunes (filters out) the bucket files for the bucket IDs that are not listed in the bucket IDs for bucket pruning.
createBucketedReadRDD creates a FilePartition for every bucket ID and the (pruned) bucket PartitionedFiles.
In the end, createBucketedReadRDD creates a FileScanRDD (with the input readFile for the read function and the FilePartitions for every bucket ID for partitions)
|
Tip
|
Use
|
|
Note
|
createBucketedReadRDD is used exclusively when FileSourceScanExec physical operator is requested for the inputRDD (and the optional bucketing specification of the HadoopFsRelation is defined and bucketing is enabled).
|
supportsBatch Attribute
|
1 2 3 4 5 |
supportsBatch: Boolean |
|
Note
|
supportsBatch is part of the ColumnarBatchScan Contract to enable vectorized decoding.
|
supportsBatch is enabled (i.e. true) only when the FileFormat (of the HadoopFsRelation) supports vectorized decoding.
Otherwise, supportsBatch is disabled (i.e. false).
FileSourceScanExec As ColumnarBatchScan
FileSourceScanExec is a ColumnarBatchScan and supports batch decoding only when the FileFormat (of the HadoopFsRelation) supports it.
FileSourceScanExec has needsUnsafeRowConversion flag enabled for ParquetFileFormat data sources exclusively.
FileSourceScanExec has vectorTypes…FIXME
needsUnsafeRowConversion Flag
|
1 2 3 4 5 |
needsUnsafeRowConversion: Boolean |
|
Note
|
needsUnsafeRowConversion is part of ColumnarBatchScan Contract to control the name of the variable for an input row while generating the Java source code to consume generated columns or row from a physical operator.
|
needsUnsafeRowConversion is enabled (i.e. true) when the following conditions all hold:
-
FileFormat of the HadoopFsRelation is ParquetFileFormat
-
spark.sql.parquet.enableVectorizedReader configuration property is enabled (default:
true)
Otherwise, needsUnsafeRowConversion is disabled (i.e. false).
|
Note
|
needsUnsafeRowConversion is used when FileSourceScanExec is executed (and supportsBatch flag is off).
|
Requesting Concrete ColumnVector Class Names — vectorTypes Method
|
1 2 3 4 5 |
vectorTypes: Option[Seq[String]] |
|
Note
|
vectorTypes is part of ColumnarBatchScan Contract to..FIXME.
|
vectorTypes simply requests the FileFormat of the HadoopFsRelation for vectorTypes.
Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method
|
1 2 3 4 5 |
doExecute(): RDD[InternalRow] |
|
Note
|
doExecute is part of the SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).
|
doExecute branches off per supportsBatch flag.
If supportsBatch is on, doExecute creates a WholeStageCodegenExec (with codegenStageId as 0) and executes it right after.
If supportsBatch is off, doExecute creates an unsafeRows RDD to scan over which is different per needsUnsafeRowConversion flag.
If needsUnsafeRowConversion flag is on, doExecute takes the inputRDD and creates a new RDD by applying a function to each partition (using RDD.mapPartitionsWithIndexInternal):
-
Creates a UnsafeProjection for the schema
-
Initializes the UnsafeProjection
-
Maps over the rows in a partition iterator using the
UnsafeProjectionprojection
Otherwise, doExecute simply takes the inputRDD as the unsafeRows RDD (with no changes).
doExecute takes the numOutputRows metric and creates a new RDD by mapping every element in the unsafeRows and incrementing the numOutputRows metric.
|
Tip
|
Use With supportsBatch off and needsUnsafeRowConversion on you should see two more RDDs in the RDD lineage. |
Creating Input RDD of Internal Rows — inputRDD Internal Property
|
1 2 3 4 5 |
inputRDD: RDD[InternalRow] |
|
Note
|
inputRDD is a Scala lazy value which is computed once when accessed and cached afterwards.
|
inputRDD is an input RDD of internal binary rows (i.e. InternalRow) that is used when FileSourceScanExec physical operator is requested for inputRDDs and execution.
When created, inputRDD requests HadoopFsRelation to get the underlying FileFormat that is in turn requested to build a data reader with partition column values appended (with the input parameters from the properties of HadoopFsRelation and pushedDownFilters).
In case HadoopFsRelation has bucketing specification defined and bucketing support is enabled, inputRDD creates a FileScanRDD with bucketing (with the bucketing specification, the reader, selectedPartitions and the HadoopFsRelation itself). Otherwise, inputRDD createNonBucketedReadRDD.
|
Note
|
createBucketedReadRDD accepts a bucketing specification while createNonBucketedReadRDD does not. |
Output Data Ordering — outputOrdering Attribute
|
1 2 3 4 5 |
outputOrdering: Seq[SortOrder] |
|
Note
|
outputOrdering is part of the SparkPlan Contract to specify output data ordering.
|
outputOrdering is a SortOrder expression for every sort column in Ascending order only when all the following hold:
-
HadoopFsRelation has a bucketing specification defined
-
All the buckets have a single file in it
Otherwise, outputOrdering is simply empty (Nil).
spark技术分享