DataSourceScanExec Contract — Leaf Physical Operators to Scan Over BaseRelation
DataSourceScanExec
is the contract of leaf physical operators that represent scans over BaseRelation.
Note
|
There are two DataSourceScanExecs, i.e. FileSourceScanExec and RowDataSourceScanExec, with a scan over data in HadoopFsRelation and generic BaseRelation relations, respectively. |
DataSourceScanExec
supports Java code generation (aka codegen)
1 2 3 4 5 6 7 8 9 10 11 12 13 |
package org.apache.spark.sql.execution trait DataSourceScanExec extends LeafExecNode with CodegenSupport { // only required vals and methods that have no implementation // the others follow def metadata: Map[String, String] val relation: BaseRelation val tableIdentifier: Option[TableIdentifier] } |
Property | Description |
---|---|
|
Metadata (as a collection of key-value pairs) that describes the scan when requested for the simple text representation. |
|
BaseRelation that is used in the node name and…FIXME |
|
Note
|
The prefix for variable names for DataSourceScanExec operators in a generated Java source code is scan.
|
The default node name prefix is an empty string (that is used in the simple node description).
DataSourceScanExec
uses the BaseRelation and the TableIdentifier as the node name in the following format:
1 2 3 4 5 |
Scan [relation] [tableIdentifier] |
DataSourceScanExec | Description |
---|---|
Simple (Basic) Text Node Description (in Query Plan Tree) — simpleString
Method
1 2 3 4 5 |
simpleString: String |
Note
|
simpleString is part of QueryPlan Contract to give the simple text description of a TreeNode in a query plan tree.
|
simpleString
creates a text representation of every key-value entry in the metadata…FIXME
Internally, simpleString
sorts the metadata and concatenate the keys and the values (separated by the :
redacts sensitive information in every value and abbreviates it to the first 100 characters.). While doing so,
simpleString
simpleString
uses Spark Core’s Utils
to truncatedString
.
In the end, simpleString
returns a text representation that is made up of the nodeNamePrefix, the nodeName, the output (schema attributes) and the metadata and is of the following format:
1 2 3 4 5 |
[nodeNamePrefix][nodeName][[output]][metadata] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
val scanExec = basicDataSourceScanExec scala> println(scanExec.simpleString) Scan $line143.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anon$1@57d94b26 [] PushedFilters: [], ReadSchema: struct<> def basicDataSourceScanExec = { import org.apache.spark.sql.catalyst.expressions.AttributeReference val output = Seq.empty[AttributeReference] val requiredColumnsIndex = output.indices import org.apache.spark.sql.sources.Filter val filters, handledFilters = Set.empty[Filter] import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.UnsafeRow val row: InternalRow = new UnsafeRow(0) val rdd: RDD[InternalRow] = sc.parallelize(row :: Nil) import org.apache.spark.sql.sources.{BaseRelation, TableScan} val baseRelation: BaseRelation = new BaseRelation with TableScan { import org.apache.spark.sql.SQLContext val sqlContext: SQLContext = spark.sqlContext import org.apache.spark.sql.types.StructType val schema: StructType = new StructType() import org.apache.spark.rdd.RDD import org.apache.spark.sql.Row def buildScan(): RDD[Row] = ??? } val tableIdentifier = None import org.apache.spark.sql.execution.RowDataSourceScanExec RowDataSourceScanExec( output, requiredColumnsIndex, filters, handledFilters, rdd, baseRelation, tableIdentifier) } |
verboseString
Method
1 2 3 4 5 |
verboseString: String |
Note
|
verboseString is part of QueryPlan Contract to…FIXME.
|
verboseString
simply returns the redacted sensitive information in verboseString (of the parent QueryPlan
).
Text Representation of All Nodes in Tree — treeString
Method
1 2 3 4 5 |
treeString(verbose: Boolean, addSuffix: Boolean): String |
Note
|
treeString is part of TreeNode Contract to…FIXME.
|
treeString
simply returns the redacted sensitive information in the text representation of all nodes (in query plan tree) (of the parent TreeNode
).