关注 spark技术分享,
撸spark源码 玩spark最佳实践

BaseRelation — Collection of Tuples with Schema

BaseRelation — Collection of Tuples with Schema

BaseRelation is the contract of relations (aka collections of tuples) with a known schema.

Note
“Data source”, “relation” and “table” are often used as synonyms.

Table 1. (Subset of) BaseRelation Contract
Method Description

schema

StructType that describes the schema of tuples

sqlContext

SQLContext

BaseRelation is “created” when DataSource is requested to resolve a relation.

BaseRelation is transformed into a DataFrame when SparkSession is requested to create a DataFrame.

BaseRelation uses needConversion flag to control type conversion of objects inside Rows to Catalyst types, e.g. java.lang.String to UTF8String.

Note
It is recommended that custom data sources (outside Spark SQL) should leave needConversion flag enabled, i.e. true.

BaseRelation can optionally give an estimated size (in bytes).

Table 2. BaseRelations
BaseRelation Description

ConsoleRelation

Used in Spark Structured Streaming

HadoopFsRelation

JDBCRelation

KafkaRelation

Datasets with records from Apache Kafka

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — needConversion Method

needConversion flag is enabled (true) by default.

Note
It is recommended to leave needConversion enabled for data sources outside Spark SQL.
Note
needConversion is used exclusively when DataSourceStrategy execution planning strategy is executed (and does the RDD conversion from RDD[Row] to RDD[InternalRow]).

Finding Unhandled Filter Predicates — unhandledFilters Method

unhandledFilters returns Filter predicates that the data source does not support (handle) natively.

Note
unhandledFilters returns the input filters by default as it is considered safe to double evaluate filters regardless whether they could be supported or not.
Note
unhandledFilters is used exclusively when DataSourceStrategy execution planning strategy is requested to selectFilters.

Requesting Estimated Size — sizeInBytes Method

sizeInBytes is the estimated size of a relation (used in query planning).

Note
sizeInBytes defaults to spark.sql.defaultSizeInBytes internal property (i.e. infinite).
Note
sizeInBytes is used exclusively when LogicalRelation is requested to computeStats (and they are not available in CatalogTable).
赞(0) 打赏
未经允许不得转载:spark技术分享 » BaseRelation — Collection of Tuples with Schema
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏