BaseRelation — Collection of Tuples with Schema
Note
|
“Data source”, “relation” and “table” are often used as synonyms. |
1 2 3 4 5 6 7 8 9 10 11 12 |
package org.apache.spark.sql.sources abstract class BaseRelation { // only required properties (vals and methods) that have no implementation // the others follow def schema: StructType def sqlContext: SQLContext } |
Method | Description |
---|---|
|
StructType that describes the schema of tuples |
|
BaseRelation
is “created” when DataSource
is requested to resolve a relation.
BaseRelation
is transformed into a DataFrame
when SparkSession
is requested to create a DataFrame.
BaseRelation
uses needConversion flag to control type conversion of objects inside Rows to Catalyst types, e.g. java.lang.String
to UTF8String
.
Note
|
It is recommended that custom data sources (outside Spark SQL) should leave needConversion flag enabled, i.e. true .
|
BaseRelation
can optionally give an estimated size (in bytes).
BaseRelation | Description |
---|---|
|
|
Should JVM Objects Inside Rows Be Converted to Catalyst Types? — needConversion
Method
1 2 3 4 5 |
needConversion: Boolean |
needConversion
flag is enabled (true
) by default.
Note
|
It is recommended to leave needConversion enabled for data sources outside Spark SQL.
|
Note
|
needConversion is used exclusively when DataSourceStrategy execution planning strategy is executed (and does the RDD conversion from RDD[Row] to RDD[InternalRow] ).
|
Finding Unhandled Filter Predicates — unhandledFilters
Method
1 2 3 4 5 |
unhandledFilters(filters: Array[Filter]): Array[Filter] |
unhandledFilters
returns Filter predicates that the data source does not support (handle) natively.
Note
|
unhandledFilters returns the input filters by default as it is considered safe to double evaluate filters regardless whether they could be supported or not.
|
Note
|
unhandledFilters is used exclusively when DataSourceStrategy execution planning strategy is requested to selectFilters.
|
Requesting Estimated Size — sizeInBytes
Method
1 2 3 4 5 |
sizeInBytes: Long |
sizeInBytes
is the estimated size of a relation (used in query planning).
Note
|
sizeInBytes defaults to spark.sql.defaultSizeInBytes internal property (i.e. infinite).
|
Note
|
sizeInBytes is used exclusively when LogicalRelation is requested to computeStats (and they are not available in CatalogTable).
|