BaseRelation — Collection of Tuples with Schema-spark技术分享

BaseRelation — Collection of Tuples with Schema

BaseRelation is the contract of relations (aka collections of tuples) with a known schema.

Note	“Data source”, “relation” and “table” are often used as synonyms.



package org.apache.spark.sql.sources

abstract class BaseRelation {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def schema: StructType
  def sqlContext: SQLContext
}

package org.apache.spark.sql.sources

abstract class BaseRelation {

// only required properties (vals and methods) that have no implementation

// the others follow

def schema: StructType

def sqlContext: SQLContext

}

Table 1. (Subset of) BaseRelation Contract
Method	Description
`schema`	StructType that describes the schema of tuples
`sqlContext`	SQLContext

BaseRelation is “created” when DataSource is requested to resolve a relation.

BaseRelation is transformed into a DataFrame when SparkSession is requested to create a DataFrame.

BaseRelation uses needConversion flag to control type conversion of objects inside Rows to Catalyst types, e.g. java.lang.String to UTF8String.

Note	It is recommended that custom data sources (outside Spark SQL) should leave needConversion flag enabled, i.e. `true`.

BaseRelation can optionally give an estimated size (in bytes).

Table 2. BaseRelations
BaseRelation	Description
`ConsoleRelation`	Used in Spark Structured Streaming
HadoopFsRelation
JDBCRelation
KafkaRelation	Datasets with records from Apache Kafka

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — `needConversion` Method



needConversion: Boolean

needConversion: Boolean

needConversion flag is enabled (true) by default.

Note	It is recommended to leave `needConversion` enabled for data sources outside Spark SQL.

Note	`needConversion` is used exclusively when `DataSourceStrategy` execution planning strategy is executed (and does the RDD conversion from `RDD[Row]` to `RDD[InternalRow]`).

Finding Unhandled Filter Predicates — `unhandledFilters` Method



unhandledFilters(filters: Array[Filter]): Array[Filter]

unhandledFilters(filters: Array[Filter]): Array[Filter]

unhandledFilters returns Filter predicates that the data source does not support (handle) natively.

Note	`unhandledFilters` returns the input `filters` by default as it is considered safe to double evaluate filters regardless whether they could be supported or not.

Note	`unhandledFilters` is used exclusively when `DataSourceStrategy` execution planning strategy is requested to selectFilters.

Requesting Estimated Size — `sizeInBytes` Method



sizeInBytes: Long

sizeInBytes: Long

sizeInBytes is the estimated size of a relation (used in query planning).

Note	`sizeInBytes` defaults to spark.sql.defaultSizeInBytes internal property (i.e. infinite).

Note	`sizeInBytes` is used exclusively when `LogicalRelation` is requested to computeStats (and they are not available in CatalogTable).

BaseRelation — Collection of Tuples with Schema