BaseRelation — Collection of Tuples with Schema
|
Note
|
“Data source”, “relation” and “table” are often used as synonyms. |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
package org.apache.spark.sql.sources abstract class BaseRelation { // only required properties (vals and methods) that have no implementation // the others follow def schema: StructType def sqlContext: SQLContext } |
| Method | Description |
|---|---|
|
|
StructType that describes the schema of tuples |
|
|
BaseRelation is “created” when DataSource is requested to resolve a relation.
BaseRelation is transformed into a DataFrame when SparkSession is requested to create a DataFrame.
BaseRelation uses needConversion flag to control type conversion of objects inside Rows to Catalyst types, e.g. java.lang.String to UTF8String.
|
Note
|
It is recommended that custom data sources (outside Spark SQL) should leave needConversion flag enabled, i.e. true.
|
BaseRelation can optionally give an estimated size (in bytes).
| BaseRelation | Description |
|---|---|
|
|
|
Should JVM Objects Inside Rows Be Converted to Catalyst Types? — needConversion Method
|
1 2 3 4 5 |
needConversion: Boolean |
needConversion flag is enabled (true) by default.
|
Note
|
It is recommended to leave needConversion enabled for data sources outside Spark SQL.
|
|
Note
|
needConversion is used exclusively when DataSourceStrategy execution planning strategy is executed (and does the RDD conversion from RDD[Row] to RDD[InternalRow]).
|
Finding Unhandled Filter Predicates — unhandledFilters Method
|
1 2 3 4 5 |
unhandledFilters(filters: Array[Filter]): Array[Filter] |
unhandledFilters returns Filter predicates that the data source does not support (handle) natively.
|
Note
|
unhandledFilters returns the input filters by default as it is considered safe to double evaluate filters regardless whether they could be supported or not.
|
|
Note
|
unhandledFilters is used exclusively when DataSourceStrategy execution planning strategy is requested to selectFilters.
|
Requesting Estimated Size — sizeInBytes Method
|
1 2 3 4 5 |
sizeInBytes: Long |
sizeInBytes is the estimated size of a relation (used in query planning).
|
Note
|
sizeInBytes defaults to spark.sql.defaultSizeInBytes internal property (i.e. infinite).
|
|
Note
|
sizeInBytes is used exclusively when LogicalRelation is requested to computeStats (and they are not available in CatalogTable).
|
spark技术分享