关注 spark技术分享,
撸spark源码 玩spark最佳实践

PrunedFilteredScan Contract — Relations with Column Pruning and Filter Pushdown

admin阅读(1607)

PrunedFilteredScan Contract — Relations with Column Pruning and Filter Pushdown

PrunedFilteredScan is the contract of BaseRelations with support for column pruning (i.e. eliminating unneeded columns) and filter pushdown (i.e. filtering using selected predicates only).

Table 1. PrunedFilteredScan Contract
Property Description

buildScan

Building distributed data scan with column pruning and filter pushdown

In other words, buildScan creates a RDD[Row] to represent a distributed data scan (i.e. scanning over data in a relation)

Used exclusively when DataSourceStrategy execution planning strategy is requested to plan a LogicalRelation with a PrunedFilteredScan.

Note
PrunedFilteredScan is a “lighter” and stable version of the CatalystScan Contract.
Note
JDBCRelation is the one and only known implementation of the PrunedFilteredScan Contract in Spark SQL.

InsertableRelation Contract — Non-File-Based Relations with Inserting or Overwriting Data Support

admin阅读(1300)

InsertableRelation Contract — Non-File-Based Relations with Inserting or Overwriting Data Support

InsertableRelation is the contract of non-file-based BaseRelations that support inserting or overwriting data.

Table 1. InsertableRelation Contract
Property Description

insert

Inserts or overwrites data (as DataFrame) in a relation

Used exclusively when InsertIntoDataSourceCommand logical command is executed

Note
JDBCRelation is the one and only known direct implementation of InsertableRelation Contract in Spark SQL.

HadoopFsRelation — Relation for File-Based Data Source

admin阅读(1765)

HadoopFsRelation — Relation for File-Based Data Source

HadoopFsRelation is a BaseRelation and FileRelation.

HadoopFsRelation is created when:

The optional BucketSpec is defined exclusively for a non-streaming file-based data source and used for the following:

Creating HadoopFsRelation Instance

HadoopFsRelation takes the following when created:

HadoopFsRelation initializes the internal registries and counters.

BaseRelation — Collection of Tuples with Schema

admin阅读(1608)

BaseRelation — Collection of Tuples with Schema

BaseRelation is the contract of relations (aka collections of tuples) with a known schema.

Note
“Data source”, “relation” and “table” are often used as synonyms.

Table 1. (Subset of) BaseRelation Contract
Method Description

schema

StructType that describes the schema of tuples

sqlContext

SQLContext

BaseRelation is “created” when DataSource is requested to resolve a relation.

BaseRelation is transformed into a DataFrame when SparkSession is requested to create a DataFrame.

BaseRelation uses needConversion flag to control type conversion of objects inside Rows to Catalyst types, e.g. java.lang.String to UTF8String.

Note
It is recommended that custom data sources (outside Spark SQL) should leave needConversion flag enabled, i.e. true.

BaseRelation can optionally give an estimated size (in bytes).

Table 2. BaseRelations
BaseRelation Description

ConsoleRelation

Used in Spark Structured Streaming

HadoopFsRelation

JDBCRelation

KafkaRelation

Datasets with records from Apache Kafka

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — needConversion Method

needConversion flag is enabled (true) by default.

Note
It is recommended to leave needConversion enabled for data sources outside Spark SQL.
Note
needConversion is used exclusively when DataSourceStrategy execution planning strategy is executed (and does the RDD conversion from RDD[Row] to RDD[InternalRow]).

Finding Unhandled Filter Predicates — unhandledFilters Method

unhandledFilters returns Filter predicates that the data source does not support (handle) natively.

Note
unhandledFilters returns the input filters by default as it is considered safe to double evaluate filters regardless whether they could be supported or not.
Note
unhandledFilters is used exclusively when DataSourceStrategy execution planning strategy is requested to selectFilters.

Requesting Estimated Size — sizeInBytes Method

sizeInBytes is the estimated size of a relation (used in query planning).

Note
sizeInBytes defaults to spark.sql.defaultSizeInBytes internal property (i.e. infinite).
Note
sizeInBytes is used exclusively when LogicalRelation is requested to computeStats (and they are not available in CatalogTable).

SchemaRelationProvider Contract — Relation Providers With Mandatory User-Defined Schema

admin阅读(1472)

SchemaRelationProvider Contract — Relation Providers With Mandatory User-Defined Schema

The requirement of specifying a user-defined schema is enforced when DataSource is requested for a BaseRelation for a given data source format. If not specified, DataSource throws a AnalysisException:

Table 1. SchemaRelationProvider Contract
Method Description

createRelation

Creates a BaseRelation for the user-defined schema

Used exclusively when DataSource is requested for a BaseRelation for a given data source format

Note
There are no known direct implementation of PrunedFilteredScan Contract in Spark SQL.
Tip
Use RelationProvider for data source providers with schema inference.
Tip
Use both SchemaRelationProvider and RelationProvider if a data source should support both schema inference and user-defined schemas.

RelationProvider Contract — Relation Providers With Schema Inference

admin阅读(2150)

RelationProvider Contract — Relation Providers With Schema Inference

Note
Schema inference is also called schema discovery.

The requirement of not specifying a user-defined schema or having one that does not match the relation is enforced when DataSource is requested for a BaseRelation for a given data source format. If specified and does not match, DataSource throws a AnalysisException:

Table 1. RelationProvider Contract
Method Description

createRelation

Creates a BaseRelation for loading data from an external data source

Used exclusively when DataSource is requested for a BaseRelation for a given data source format (and no user-defined schema or the user-defined schema matches schema of the BaseRelation)

Table 2. RelationProviders
RelationProvider Description

JdbcRelationProvider

KafkaSourceProvider

Tip
Use SchemaRelationProvider for relation providers that require a user-defined schema.

DataSourceRegister Contract — Registering Data Source Format

admin阅读(3399)

DataSourceRegister Contract — Registering Data Source Format

DataSourceRegister is a contract to register a DataSource provider under shortName alias (so it can be looked up by the alias not its fully-qualified class name).

Data Source Format Discovery — Registering Data Source By Short Name (Alias)

Caution
FIXME Describe how Java’s ServiceLoader works to find all DataSourceRegister provider classes on the CLASSPATH.

Any DataSourceRegister has to register itself in META-INF/services/org.apache.spark.sql.sources.DataSourceRegister file to…​FIXME

CreatableRelationProvider Contract — Data Sources That Write Rows Per Save Mode

admin阅读(5035)

CreatableRelationProvider Contract — Data Sources That Write Rows Per Save Mode

Table 1. CreatableRelationProvider Contract
Method Description

createRelation

Saving the rows of a structured query (a DataFrame) to an external data source

createRelation saves the rows of a structured query (a DataFrame) to a target relation per save mode and parameters. In the end, createRelation creates a BaseRelation to represent the relation created.

The save mode specifies what happens when the destination already exists and can be one of the following:

  • Append

  • ErrorIfExists

  • Ignore

  • Overwrite

CreatableRelationProvider is used when:

Table 2. CreatableRelationProviders
CreatableRelationProvider Description

JdbcRelationProvider

Data source provider for JDBC data source

KafkaSourceProvider

Data source provider for Kafka data source

关注公众号:spark技术分享

联系我们联系我们