关注 spark技术分享,
撸spark源码 玩spark最佳实践

JdbcDialect

admin阅读(1594)

JdbcDialect

JdbcDialect is the base of JDBC dialects that handle a specific JDBC URL (and handle necessary type-related conversions to properly load a data from a table into a DataFrame).

Table 1. (Subset of) JdbcDialect Contract
Property Description

canHandle

Used when…​FIXME

Table 2. JdbcDialects
JdbcDialect Description

AggregatedDialect

DB2Dialect

DerbyDialect

MsSqlServerDialect

MySQLDialect

NoopDialect

OracleDialect

PostgresDialect

TeradataDialect

getCatalystType Method

getCatalystType…​FIXME

Note
getCatalystType is used when…​FIXME

getJDBCType Method

getJDBCType…​FIXME

Note
getJDBCType is used when…​FIXME

quoteIdentifier Method

quoteIdentifier…​FIXME

Note
quoteIdentifier is used when…​FIXME

getTableExistsQuery Method

getTableExistsQuery…​FIXME

Note
getTableExistsQuery is used when…​FIXME

getSchemaQuery Method

getSchemaQuery…​FIXME

Note
getSchemaQuery is used when…​FIXME

getTruncateQuery Method

getTruncateQuery…​FIXME

Note
getTruncateQuery is used when…​FIXME

beforeFetch Method

beforeFetch…​FIXME

Note
beforeFetch is used when…​FIXME

escapeSql Internal Method

escapeSql…​FIXME

Note
escapeSql is used when…​FIXME

compileValue Method

compileValue…​FIXME

Note
compileValue is used when…​FIXME

isCascadingTruncateTable Method

isCascadingTruncateTable…​FIXME

Note
isCascadingTruncateTable is used when…​FIXME

JDBCRDD

admin阅读(1573)

JDBCRDD

JDBCRDD is a RDD of internal binary rows that represents a structured query over a table in a database accessed via JDBC.

Note
JDBCRDD represents a “SELECT requiredColumns FROM table” query.

JDBCRDD is created exclusively when JDBCRDD is requested to scanTable (when JDBCRelation is requested to build a scan).

Table 1. JDBCRDD’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

columnList

Column names

Used when…​FIXME

filterWhereClause

Filters as a SQL WHERE clause

Used when…​FIXME

Computing Partition (in TaskContext) — compute Method

Note
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).

compute…​FIXME

resolveTable Method

resolveTable…​FIXME

Note
resolveTable is used exclusively when JDBCRelation is requested for the schema.

Creating RDD for Distributed Data Scan — scanTable Object Method

scanTable takes the url option.

scanTable finds the corresponding JDBC dialect (per the url option) and requests it to quote the column identifiers in the input requiredColumns.

scanTable uses the JdbcUtils object to createConnectionFactory and prune columns from the input schema to include the input requiredColumns only.

In the end, scanTable creates a new JDBCRDD.

Note
scanTable is used exclusively when JDBCRelation is requested to build a distributed data scan with column pruning and filter pushdown.

Creating JDBCRDD Instance

JDBCRDD takes the following when created:

  • SparkContext

  • Function to create a Connection (() ⇒ Connection)

  • Schema (StructType)

  • Array of column names

  • Array of Filter predicates

  • Array of Spark Core’s Partitions

  • Connection URL

  • JDBCOptions

JDBCRDD initializes the internal registries and counters.

getPartitions Method

Note
getPartitions is part of Spark Core’s RDD Contract to…​FIXME

getPartitions simply returns the partitions (this JDBCRDD was created with).

pruneSchema Internal Method

pruneSchema…​FIXME

Note
pruneSchema is used when…​FIXME

Converting Filter Predicate to SQL Expression — compileFilter Object Method

compileFilter…​FIXME

Note

compileFilter is used when:

JDBCRelation

admin阅读(1567)

JDBCRelation — Relation with Inserting or Overwriting Data, Column Pruning and Filter Pushdown

As a BaseRelation, JDBCRelation defines the schema of tuples (data) and the SQLContext.

As a InsertableRelation, JDBCRelation supports inserting or overwriting data.

JDBCRelation is created when:

When requested for a human-friendly text representation, JDBCRelation requests the JDBCOptions for the name of the table and the number of partitions (if defined).

spark sql JDBCRelation webui query details.png
Figure 1. JDBCRelation in web UI (Details for Query)

JDBCRelation uses the SparkSession to return a SQLContext.

JDBCRelation turns the needConversion flag off (to announce that buildScan returns an RDD[InternalRow] already and DataSourceStrategy execution planning strategy does not have to do the RDD conversion).

Creating JDBCRelation Instance

JDBCRelation takes the following when created:

Finding Unhandled Filter Predicates — unhandledFilters Method

Note
unhandledFilters is part of BaseRelation Contract to find unhandled Filter predicates.

unhandledFilters returns the Filter predicates in the input filters that could not be converted to a SQL expression (and are therefore unhandled by the JDBC data source natively).

Schema of Tuples (Data) — schema Property

Note
schema is part of BaseRelation Contract to return the schema of the tuples in a relation.

schema uses JDBCRDD to resolveTable given the JDBCOptions (that simply returns the Catalyst schema of the table, also known as the default table schema).

If customSchema JDBC option was defined, schema uses JdbcUtils to replace the data types in the default table schema.

Inserting or Overwriting Data to JDBC Table — insert Method

Note
insert is part of InsertableRelation Contract that inserts or overwrites data in a relation.

insert simply requests the input DataFrame for a DataFrameWriter that in turn is requested to save the data to a table using the JDBC data source (itself!) with the url, table and all options.

insert also requests the DataFrameWriter to set the save mode as Overwrite or Append per the input overwrite flag.

Note
insert uses a “trick” to reuse a code that is responsible for saving data to a JDBC table.

Building Distributed Data Scan with Column Pruning and Filter Pushdown — buildScan Method

Note
buildScan is part of PrunedFilteredScan Contract to build a distributed data scan (as a RDD[Row]) with support for column pruning and filter pushdown.

buildScan uses the JDBCRDD object to create a RDD[Row] for a distributed data scan.

JdbcRelationProvider

admin阅读(1624)

JdbcRelationProvider

JdbcRelationProvider is a DataSourceRegister and registers itself to handle jdbc data source format.

Note
JdbcRelationProvider uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister file for the registration which is available in the source code of Apache Spark.

JdbcRelationProvider is a RelationProvider and a CreatableRelationProvider.

JdbcRelationProvider is used when DataFrameReader is requested to load data from jdbc data source.

Loading Data from Table Using JDBC — createRelation Method (from RelationProvider)

Note
createRelation is part of RelationProvider Contract to create a BaseRelation for reading.

createRelation creates a JDBCPartitioningInfo (using JDBCOptions and the input parameters that correspond to the Options for JDBC Data Source).

Note
createRelation uses partitionColumn, lowerBound, upperBound and numPartitions.

In the end, createRelation creates a JDBCRelation with column partitions (and JDBCOptions).

Writing Rows of Structured Query (DataFrame) to Table Using JDBC — createRelation Method (from CreatableRelationProvider)

Note
createRelation is part of the CreatableRelationProvider Contract to write the rows of a structured query (a DataFrame) to an external data source.

Internally, createRelation creates a JDBCOptions (from the input parameters).

createRelation reads caseSensitiveAnalysis (using the input sqlContext).

createRelation checks whether the table (given dbtable and url options in the input parameters) exists.

Note
createRelation uses a database-specific JdbcDialect to check whether a table exists.

createRelation branches off per whether the table already exists in the database or not.

If the table does not exist, createRelation creates the table (by executing CREATE TABLE with createTableColumnTypes and createTableOptions options from the input parameters) and writes the rows to the database in a single transaction.

If however the table does exist, createRelation branches off per SaveMode (see the following createRelation and SaveMode).

Table 1. createRelation and SaveMode
Name Description

Append

Saves the records to the table.

ErrorIfExists

Throws a AnalysisException with the message:

Ignore

Does nothing.

Overwrite

Truncates or drops the table

Note
createRelation truncates the table only when truncate JDBC option is enabled and isCascadingTruncateTable is disabled.

In the end, createRelation closes the JDBC connection to the database and creates a JDBCRelation.

JDBCOptions — JDBC Data Source Options

admin阅读(1583)

JDBCOptions — JDBC Data Source Options

JDBCOptions represents the options of the JDBC data source.

Table 1. Options for JDBC Data Source
Option / Key Default Value Description

batchsize

1000

The minimum value is 1

Used exclusively when JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table through JdbcUtils helper object and its saveTable.

createTableColumnTypes

Used exclusively when JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table through JdbcUtils helper object and its createTable.

createTableOptions

Empty string

Used exclusively when JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table through JdbcUtils helper object and its createTable.

customSchema

(undefined)

Specifies the custom data types of the read schema (that is used at load time)

customSchema is a comma-separated list of field definitions with column names and their data types in a canonical SQL representation, e.g. id DECIMAL(38, 0), name STRING.

customSchema defines the data types of the columns that will override the data types inferred from the table schema and follows the following pattern:

Used exclusively when JDBCRelation is requested for the schema.

dbtable

(required)

Used when:

driver

(recommended) Class name of the JDBC driver to use

Used exclusively when JDBCOptions is created. When the driver option is defined, the JDBC driver class will get registered with Java’s java.sql.DriverManager.

Note
driver takes precedence over the class name of the driver for the url option.

After the JDBC driver class was registered, the driver class is used exclusively when JdbcUtils helper object is requested to createConnectionFactory.

fetchsize

0

Hint to the JDBC driver as to the number of rows that should be fetched from the database when more rows are needed for ResultSet objects generated by a Statement

The minimum value is 0 (which tells the JDBC driver to do the estimates)

Used exclusively when JDBCRDD is requested to compute a partition.

isolationLevel

READ_UNCOMMITTED

One of the following:

  • NONE

  • READ_UNCOMMITTED

  • READ_COMMITTED

  • REPEATABLE_READ

  • SERIALIZABLE

Used exclusively when JdbcUtils is requested to saveTable.

lowerBound

Lower bound of partition column

Used exclusively when JdbcRelationProvider is requested to create a BaseRelation for reading

numPartitions

Number of partitions to use for loading or saving data

Used when:

partitionColumn

Name of the column used to partition dataset (using a JDBCPartitioningInfo).

Used exclusively when JdbcRelationProvider is requested to create a BaseRelation for reading (with proper JDBCPartitions with WHERE clause)

When defined, the lowerBound, upperBound and numPartitions options are also required.

When undefined, lowerBound and upperBound have to be undefined.

truncate

false

(used only for writing) Enables table truncation

Used exclusively when JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table

sessionInitStatement

A generic SQL statement (or PL/SQL block) executed before reading a table/query

Used exclusively when JDBCRDD is requested to compute a partition.

upperBound

Upper bound of the partition column

Used exclusively when JdbcRelationProvider is requested to create a BaseRelation for reading

url

(required) A JDBC URL to use to connect to a database

Note
The options are case-insensitive.

JDBCOptions is created when:

Creating JDBCOptions Instance

JDBCOptions takes the following when created:

  • JDBC URL

  • Name of the table

  • Case-insensitive configuration parameters (i.e. Map[String, String])

The input URL and table are set as the current url and dbtable options (overriding the values in the input parameters if defined).

Converting Parameters (Options) to Java Properties — asProperties Property

asProperties…​FIXME

Note

asProperties is used when:

asConnectionProperties Property

asConnectionProperties…​FIXME

Note
asConnectionProperties is used exclusively when JdbcUtils is requested to createConnectionFactory

JDBC Data Source

admin阅读(2501)

JDBC Data Source

Spark SQL supports loading data from tables using JDBC.

JDBC

The JDBC API is the Java™ SE standard for database-independent connectivity between the Java™ programming language and a wide range of databases: SQL or NoSQL databases and tabular data sources like spreadsheets or flat files.

Read more on the JDBC API in JDBC Overview and in the official Java SE 8 documentation in Java JDBC API.

As a Spark developer, you use DataFrameReader.jdbc to load data from an external table using JDBC.

These one-liners create a DataFrame that represents the distributed process of loading data from a database and a table (with additional properties).

AvroDataToCatalyst Unary Expression

admin阅读(1280)

AvroDataToCatalyst Unary Expression

AvroDataToCatalyst is a unary expression that represents from_avro function in a structured query.

AvroDataToCatalyst takes the following when created:

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — doGenCode Method

Note
doGenCode is part of Expression Contract to generate a Java source code (ExprCode) for code-generated expression evaluation.

doGenCode requests the CodegenContext to generate code to reference this AvroDataToCatalyst instance.

In the end, doGenCode defineCodeGen with the function f that uses nullSafeEval.

nullSafeEval Method

Note
nullSafeEval is part of the UnaryExpression Contract to…​FIXME.

nullSafeEval…​FIXME

CatalystDataToAvro Unary Expression

admin阅读(1592)

CatalystDataToAvro Unary Expression

CatalystDataToAvro is a unary expression that represents to_avro function in a structured query.

CatalystDataToAvro takes a single Catalyst expression when created.

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — doGenCode Method

Note
doGenCode is part of Expression Contract to generate a Java source code (ExprCode) for code-generated expression evaluation.

doGenCode requests the CodegenContext to generate code to reference this CatalystDataToAvro instance.

In the end, doGenCode defineCodeGen with the function f that uses nullSafeEval.

nullSafeEval Method

Note
nullSafeEval is part of the UnaryExpression Contract to…​FIXME.

nullSafeEval…​FIXME

AvroOptions — Avro Data Source Options

admin阅读(1378)

AvroOptions — Avro Data Source Options

AvroOptions represents the options of the Avro data source.

Table 1. Options for Avro Data Source
Option / Key Default Value Description

avroSchema

(undefined)

Avro schema in JSON format

compression

(undefined)

Specifies the compression codec to use when writing Avro data to disk

Note
If the option is not defined explicitly, Avro data source uses spark.sql.avro.compression.codec configuration property.

ignoreExtension

false

Controls whether Avro data source should read all Avro files regardless of their extension (true) or not (false)

By default, Avro data source reads only files with .avro file extension.

Note
If the option is not defined explicitly, Avro data source uses avro.mapred.ignore.inputs.without.extension Hadoop runtime property.

recordName

topLevelRecord

Top-level record name when writing Avro data to disk

Consult Apache Avro™ 1.8.2 Specification

recordNamespace

(empty)

Record namespace when writing Avro data to disk

Consult Apache Avro™ 1.8.2 Specification

Note
The options are case-insensitive.

AvroOptions is created when AvroFileFormat is requested to inferSchema, prepareWrite and buildReader.

Creating AvroOptions Instance

AvroOptions takes the following when created:

  • Case-insensitive configuration parameters (i.e. Map[String, String])

  • Hadoop Configuration

AvroFileFormat — FileFormat For Avro-Encoded Files

admin阅读(1341)

AvroFileFormat — FileFormat For Avro-Encoded Files

AvroFileFormat is a FileFormat for Apache Avro, i.e. a data source format that can read and write Avro-encoded data in files.

AvroFileFormat is a DataSourceRegister and registers itself as avro data source.

AvroFileFormat is splitable, i.e. FIXME

Building Partitioned Data Reader — buildReader Method

Note
buildReader is part of the FileFormat Contract to build a PartitionedFile reader.

buildReader…​FIXME

Inferring Schema — inferSchema Method

Note
inferSchema is part of the FileFormat Contract to infer (return) the schema of the given files.

inferSchema…​FIXME

Preparing Write Job — prepareWrite Method

Note
prepareWrite is part of the FileFormat Contract to prepare a write job.

prepareWrite…​FIXME

关注公众号:spark技术分享

联系我们联系我们