spark-sql-spark技术分享-第44页

JdbcDialect

JdbcDialect is the base of JDBC dialects that handle a specific JDBC URL (and handle necessary type-related conversions to properly load a data from a table into a DataFrame).



package org.apache.spark.sql.jdbc

abstract class JdbcDialect extends Serializable {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def canHandle(url : String): Boolean
}

1

2

3

4

5

6

7

8

9

10

11

package org.apache.spark.sql.jdbc

abstract class JdbcDialect extends Serializable {

// only required properties (vals and methods) that have no implementation

// the others follow

def canHandle(url : String): Boolean

}

Table 1. (Subset of) JdbcDialect Contract
Property	Description
`canHandle`	Used when…FIXME

Table 2. JdbcDialects
JdbcDialect	Description
`AggregatedDialect`
`DB2Dialect`
`DerbyDialect`
`MsSqlServerDialect`
`MySQLDialect`
`NoopDialect`
`OracleDialect`
`PostgresDialect`
`TeradataDialect`

`getCatalystType` Method



getCatalystType(
  sqlType: Int,
  typeName: String,
  size: Int,
  md: MetadataBuilder): Option[DataType]

1

2

3

4

5

6

7

8

9

getCatalystType(

sqlType: Int,

typeName: String,

size: Int,

md: MetadataBuilder): Option[DataType]

getCatalystType…FIXME

Note	`getCatalystType` is used when…FIXME

`getJDBCType` Method



getJDBCType(dt: DataType): Option[JdbcType]

1

2

3

4

5

getJDBCType(dt: DataType): Option[JdbcType]

getJDBCType…FIXME

Note	`getJDBCType` is used when…FIXME

`quoteIdentifier` Method



quoteIdentifier(colName: String): String

1

2

3

4

5

quoteIdentifier(colName: String): String

quoteIdentifier…FIXME

Note	`quoteIdentifier` is used when…FIXME

`getTableExistsQuery` Method



getTableExistsQuery(table: String): String

1

2

3

4

5

getTableExistsQuery(table: String): String

getTableExistsQuery…FIXME

Note	`getTableExistsQuery` is used when…FIXME

`getSchemaQuery` Method



getSchemaQuery(table: String): String

1

2

3

4

5

getSchemaQuery(table: String): String

getSchemaQuery…FIXME

Note	`getSchemaQuery` is used when…FIXME

`getTruncateQuery` Method



getTruncateQuery(table: String): String

1

2

3

4

5

getTruncateQuery(table: String): String

getTruncateQuery…FIXME

Note	`getTruncateQuery` is used when…FIXME

`beforeFetch` Method



beforeFetch(connection: Connection, properties: Map[String, String]): Unit

1

2

3

4

5

beforeFetch(connection: Connection, properties: Map[String, String]): Unit

beforeFetch…FIXME

Note	`beforeFetch` is used when…FIXME

`escapeSql` Internal Method



escapeSql(value: String): String

1

2

3

4

5

escapeSql(value: String): String

escapeSql…FIXME

Note	`escapeSql` is used when…FIXME

`compileValue` Method



compileValue(value: Any): Any

1

2

3

4

5

compileValue(value: Any): Any

compileValue…FIXME

Note	`compileValue` is used when…FIXME

`isCascadingTruncateTable` Method



isCascadingTruncateTable(): Option[Boolean]

1

2

3

4

5

isCascadingTruncateTable(): Option[Boolean]

isCascadingTruncateTable…FIXME

Note	`isCascadingTruncateTable` is used when…FIXME

JDBCRDD

JDBCRDD is a RDD of internal binary rows that represents a structured query over a table in a database accessed via JDBC.

Note	`JDBCRDD` represents a “SELECT requiredColumns FROM table” query.

JDBCRDD is created exclusively when JDBCRDD is requested to scanTable (when JDBCRelation is requested to build a scan).

Table 1. JDBCRDD’s Internal Properties (e.g. Registries, Counters and Flags)
Name	Description
`columnList`	Column names Used when…FIXME
`filterWhereClause`	Filters as a SQL `WHERE` clause Used when…FIXME

Computing Partition (in TaskContext) — `compute` Method



compute(thePart: Partition, context: TaskContext): Iterator[InternalRow]

1

2

3

4

5

compute(thePart: Partition, context: TaskContext): Iterator[InternalRow]

Note	`compute` is part of Spark Core’s `RDD` Contract to compute a partition (in a `TaskContext`).

compute…FIXME

`resolveTable` Method



resolveTable(options: JDBCOptions): StructType

1

2

3

4

5

resolveTable(options: JDBCOptions): StructType

resolveTable…FIXME

Note	`resolveTable` is used exclusively when `JDBCRelation` is requested for the schema.

Creating RDD for Distributed Data Scan — `scanTable` Object Method



scanTable(
  sc: SparkContext,
  schema: StructType,
  requiredColumns: Array[String],
  filters: Array[Filter],
  parts: Array[Partition],
  options: JDBCOptions): RDD[InternalRow]

1

2

3

4

5

6

7

8

9

10

11

scanTable(

sc: SparkContext,

schema: StructType,

requiredColumns: Array[String],

filters: Array[Filter],

parts: Array[Partition],

options: JDBCOptions): RDD[InternalRow]

scanTable takes the url option.

scanTable finds the corresponding JDBC dialect (per the url option) and requests it to quote the column identifiers in the input requiredColumns.

scanTable uses the JdbcUtils object to createConnectionFactory and prune columns from the input schema to include the input requiredColumns only.

In the end, scanTable creates a new JDBCRDD.

Note	`scanTable` is used exclusively when `JDBCRelation` is requested to build a distributed data scan with column pruning and filter pushdown.

Creating JDBCRDD Instance

JDBCRDD takes the following when created:

SparkContext
Function to create a Connection (() ⇒ Connection)
Schema (StructType)
Array of column names
Array of Filter predicates
Array of Spark Core’s Partitions
Connection URL
JDBCOptions

JDBCRDD initializes the internal registries and counters.

`getPartitions` Method



getPartitions: Array[Partition]

1

2

3

4

5

getPartitions: Array[Partition]

Note	`getPartitions` is part of Spark Core’s `RDD` Contract to…FIXME

getPartitions simply returns the partitions (this JDBCRDD was created with).

`pruneSchema` Internal Method



pruneSchema(schema: StructType, columns: Array[String]): StructType

1

2

3

4

5

pruneSchema(schema: StructType, columns: Array[String]): StructType

pruneSchema…FIXME

Note	`pruneSchema` is used when…FIXME

Converting Filter Predicate to SQL Expression — `compileFilter` Object Method



compileFilter(f: Filter, dialect: JdbcDialect): Option[String]

1

2

3

4

5

compileFilter(f: Filter, dialect: JdbcDialect): Option[String]

compileFilter…FIXME

Note	`compileFilter` is used when: `JDBCRelation` is requested to find unhandled Filter predicates `JDBCRDD` is created

JDBCRelation — Relation with Inserting or Overwriting Data, Column Pruning and Filter Pushdown

JDBCRelation is a BaseRelation that supports inserting or overwriting data and column pruning with filter pushdown.

As a BaseRelation, JDBCRelation defines the schema of tuples (data) and the SQLContext.

As a InsertableRelation, JDBCRelation supports inserting or overwriting data.

As a PrunedFilteredScan, JDBCRelation supports building distributed data scan with column pruning and filter pushdown.

JDBCRelation is created when:

DataFrameReader is requested to load data from an external table using JDBC data source
JdbcRelationProvider is requested to create a BaseRelation for reading data from a JDBC table

When requested for a human-friendly text representation, JDBCRelation requests the JDBCOptions for the name of the table and the number of partitions (if defined).



JDBCRelation([table]) [numPartitions=[number]]

1

2

3

4

5

JDBCRelation([table]) [numPartitions=[number]]

spark sql JDBCRelation webui query details.png

Figure 1. JDBCRelation in web UI (Details for Query)



scala> df.explain
== Physical Plan ==
*Scan JDBCRelation(projects) [numPartitions=1] [id#0,name#1,website#2] ReadSchema: struct<id:int,name:string,website:string>

1

2

3

4

5

6

7

scala> df.explain

== Physical Plan ==

*Scan JDBCRelation(projects) [numPartitions=1] [id#0,name#1,website#2] ReadSchema: struct<id:int,name:string,website:string>

JDBCRelation uses the SparkSession to return a SQLContext.

JDBCRelation turns the needConversion flag off (to announce that buildScan returns an RDD[InternalRow] already and DataSourceStrategy execution planning strategy does not have to do the RDD conversion).

Creating JDBCRelation Instance

JDBCRelation takes the following when created:

Array of Spark Core’s Partitions
JDBCOptions
SparkSession

Finding Unhandled Filter Predicates — `unhandledFilters` Method



unhandledFilters(filters: Array[Filter]): Array[Filter]

1

2

3

4

5

unhandledFilters(filters: Array[Filter]): Array[Filter]

Note	`unhandledFilters` is part of BaseRelation Contract to find unhandled Filter predicates.

unhandledFilters returns the Filter predicates in the input filters that could not be converted to a SQL expression (and are therefore unhandled by the JDBC data source natively).

Schema of Tuples (Data) — `schema` Property



schema: StructType

1

2

3

4

5

schema: StructType

Note	`schema` is part of BaseRelation Contract to return the schema of the tuples in a relation.

schema uses JDBCRDD to resolveTable given the JDBCOptions (that simply returns the Catalyst schema of the table, also known as the default table schema).

If customSchema JDBC option was defined, schema uses JdbcUtils to replace the data types in the default table schema.

Inserting or Overwriting Data to JDBC Table — `insert` Method



insert(data: DataFrame, overwrite: Boolean): Unit

1

2

3

4

5

insert(data: DataFrame, overwrite: Boolean): Unit

Note	`insert` is part of InsertableRelation Contract that inserts or overwrites data in a relation.

insert simply requests the input DataFrame for a DataFrameWriter that in turn is requested to save the data to a table using the JDBC data source (itself!) with the url, table and all options.

insert also requests the DataFrameWriter to set the save mode as Overwrite or Append per the input overwrite flag.

Note	`insert` uses a “trick” to reuse a code that is responsible for saving data to a JDBC table.

Building Distributed Data Scan with Column Pruning and Filter Pushdown — `buildScan` Method



buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

1

2

3

4

5

buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

Note	`buildScan` is part of PrunedFilteredScan Contract to build a distributed data scan (as a `RDD[Row]`) with support for column pruning and filter pushdown.

buildScan uses the JDBCRDD object to create a RDD[Row] for a distributed data scan.

JdbcRelationProvider

2012-05-25admin阅读(1855)

JdbcRelationProvider

JdbcRelationProvider is a DataSourceRegister and registers itself to handle jdbc data source format.

Note	`JdbcRelationProvider` uses `META-INF/services/org.apache.spark.sql.sources.DataSourceRegister` file for the registration which is available in the source code of Apache Spark.

JdbcRelationProvider is a RelationProvider and a CreatableRelationProvider.

JdbcRelationProvider is used when DataFrameReader is requested to load data from jdbc data source.



val table = spark.read.jdbc(...)

// or in a more verbose way
val table = spark.read.format("jdbc").load(...)

1

2

3

4

5

6

7

8

val table = spark.read.jdbc(...)

// or in a more verbose way

val table = spark.read.format("jdbc").load(...)

Loading Data from Table Using JDBC — `createRelation` Method (from RelationProvider)



createRelation(
  sqlContext: SQLContext,
  parameters: Map[String, String]): BaseRelation

1

2

3

4

5

6

7

createRelation(

sqlContext: SQLContext,

parameters: Map[String, String]): BaseRelation

Note	`createRelation` is part of RelationProvider Contract to create a BaseRelation for reading.

createRelation creates a JDBCPartitioningInfo (using JDBCOptions and the input parameters that correspond to the Options for JDBC Data Source).

Note	`createRelation` uses partitionColumn, lowerBound, upperBound and numPartitions.

In the end, createRelation creates a JDBCRelation with column partitions (and JDBCOptions).

Writing Rows of Structured Query (DataFrame) to Table Using JDBC — `createRelation` Method (from CreatableRelationProvider)



createRelation(
  sqlContext: SQLContext,
  mode: SaveMode,
  parameters: Map[String, String],
  df: DataFrame): BaseRelation

1

2

3

4

5

6

7

8

9

createRelation(

sqlContext: SQLContext,

mode: SaveMode,

parameters: Map[String, String],

df: DataFrame): BaseRelation

Note	`createRelation` is part of the CreatableRelationProvider Contract to write the rows of a structured query (a DataFrame) to an external data source.

Internally, createRelation creates a JDBCOptions (from the input parameters).

createRelation reads caseSensitiveAnalysis (using the input sqlContext).

createRelation checks whether the table (given dbtable and url options in the input parameters) exists.

Note	`createRelation` uses a database-specific `JdbcDialect` to check whether a table exists.

createRelation branches off per whether the table already exists in the database or not.

If the table does not exist, createRelation creates the table (by executing CREATE TABLE with createTableColumnTypes and createTableOptions options from the input parameters) and writes the rows to the database in a single transaction.

If however the table does exist, createRelation branches off per SaveMode (see the following createRelation and SaveMode).

Name Description

Append

Saves the records to the table.

ErrorIfExists

Throws a AnalysisException with the message:



Table or view '[table]' already exists. SaveMode: ErrorIfExists.

1

2

3

4

5

Table or view '[table]' already exists. SaveMode: ErrorIfExists.

Ignore

Does nothing.

Overwrite

Truncates or drops the table

Note	`createRelation` truncates the table only when truncate JDBC option is enabled and isCascadingTruncateTable is disabled.

In the end, createRelation closes the JDBC connection to the database and creates a JDBCRelation.

JDBCOptions — JDBC Data Source Options

2012-05-24admin阅读(1787)

JDBCOptions — JDBC Data Source Options

JDBCOptions represents the options of the JDBC data source.

Option / Key Default Value Description

batchsize

1000

The minimum value is 1

Used exclusively when JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table through JdbcUtils helper object and its saveTable.

createTableColumnTypes

Used exclusively when JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table through JdbcUtils helper object and its createTable.

createTableOptions

Empty string

Used exclusively when JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table through JdbcUtils helper object and its createTable.

customSchema

(undefined)

Specifies the custom data types of the read schema (that is used at load time)

customSchema is a comma-separated list of field definitions with column names and their data types in a canonical SQL representation, e.g. id DECIMAL(38, 0), name STRING.

customSchema defines the data types of the columns that will override the data types inferred from the table schema and follows the following pattern:



colTypeList
    : colType (',' colType)*
    ;

colType
    : identifier dataType (COMMENT STRING)?
    ;

dataType
    : complex=ARRAY '<' dataType '>'                            #complexDataType
    | complex=MAP '<' dataType ',' dataType '>'                 #complexDataType
    | complex=STRUCT ('<' complexColTypeList? '>' | NEQ)        #complexDataType
    | identifier ('(' INTEGER_VALUE (',' INTEGER_VALUE)* ')')?  #primitiveDataType
    ;

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

colTypeList

: colType (',' colType)*

;

colType

: identifier dataType (COMMENT STRING)?

;

dataType

: complex=ARRAY '<' dataType '>' #complexDataType

| complex=MAP '<' dataType ',' dataType '>' #complexDataType

| complex=STRUCT ('<' complexColTypeList? '>' | NEQ) #complexDataType

| identifier ('(' INTEGER_VALUE (',' INTEGER_VALUE)* ')')? #primitiveDataType

;

Used exclusively when JDBCRelation is requested for the schema.

dbtable

(required)

Used when:

JDBCRDD is requested to resolveTable (when JDBCRelation is requested for the schema) and compute a partition
JDBCRelation is requested to insert or overwrite data and for the human-friendly text representation
JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table
JdbcUtils is requested to tableExists, truncateTable, getSchemaOption, saveTable and createTable
JDBCOptions is created (with the input parameters for the url and dbtable options)
DataFrameReader is requested to load data from external table using JDBC data source (using DataFrameReader.jdbc method with the input parameters for the url and dbtable options)

driver

(recommended) Class name of the JDBC driver to use

Used exclusively when JDBCOptions is created. When the driver option is defined, the JDBC driver class will get registered with Java’s java.sql.DriverManager.

Note	`driver` takes precedence over the class name of the driver for the url option.

After the JDBC driver class was registered, the driver class is used exclusively when JdbcUtils helper object is requested to createConnectionFactory.

fetchsize

0

Hint to the JDBC driver as to the number of rows that should be fetched from the database when more rows are needed for ResultSet objects generated by a Statement

The minimum value is 0 (which tells the JDBC driver to do the estimates)

Used exclusively when JDBCRDD is requested to compute a partition.

isolationLevel

READ_UNCOMMITTED

One of the following:

NONE
READ_UNCOMMITTED
READ_COMMITTED
REPEATABLE_READ
SERIALIZABLE

Used exclusively when JdbcUtils is requested to saveTable.

lowerBound

Lower bound of partition column

Used exclusively when JdbcRelationProvider is requested to create a BaseRelation for reading

numPartitions

Number of partitions to use for loading or saving data

Used when:

JdbcRelationProvider is requested to loading data from a table using JDBC
JdbcUtils is requested to saveTable

partitionColumn

Name of the column used to partition dataset (using a JDBCPartitioningInfo).

Used exclusively when JdbcRelationProvider is requested to create a BaseRelation for reading (with proper JDBCPartitions with WHERE clause)

When defined, the lowerBound, upperBound and numPartitions options are also required.

When undefined, lowerBound and upperBound have to be undefined.

truncate

false

(used only for writing) Enables table truncation

Used exclusively when JdbcRelationProvider is requested to write the rows of a structured query (a DataFrame) to a table

sessionInitStatement

A generic SQL statement (or PL/SQL block) executed before reading a table/query

Used exclusively when JDBCRDD is requested to compute a partition.

upperBound

Upper bound of the partition column

Used exclusively when JdbcRelationProvider is requested to create a BaseRelation for reading

url

(required) A JDBC URL to use to connect to a database

Note	The options are case-insensitive.

JDBCOptions is created when:

DataFrameReader is requested to load data from an external table using JDBC (and create a DataFrame to represent the process of loading the data)
JdbcRelationProvider is requested to create a BaseRelation (as a RelationProvider for loading and a CreatableRelationProvider for writing)

Creating JDBCOptions Instance

JDBCOptions takes the following when created:

JDBC URL
Name of the table
Case-insensitive configuration parameters (i.e. Map[String, String])

The input URL and table are set as the current url and dbtable options (overriding the values in the input parameters if defined).

Converting Parameters (Options) to Java Properties — `asProperties` Property



asProperties: Properties

1

2

3

4

5

asProperties: Properties

asProperties…FIXME

Note	`asProperties` is used when: `JDBCRDD` is requested to compute a partition (that requests a `JdbcDialect` to beforeFetch) `JDBCRelation` is requested to insert a data (from a DataFrame) to a table

`asConnectionProperties` Property



asConnectionProperties: Properties

1

2

3

4

5

asConnectionProperties: Properties

asConnectionProperties…FIXME

Note	`asConnectionProperties` is used exclusively when `JdbcUtils` is requested to createConnectionFactory

JDBC Data Source

2012-05-23admin阅读(2682)

JDBC Data Source

Spark SQL supports loading data from tables using JDBC.

JDBC

The JDBC API is the Java™ SE standard for database-independent connectivity between the Java™ programming language and a wide range of databases: SQL or NoSQL databases and tabular data sources like spreadsheets or flat files.

Read more on the JDBC API in JDBC Overview and in the official Java SE 8 documentation in Java JDBC API.

As a Spark developer, you use DataFrameReader.jdbc to load data from an external table using JDBC.



val table = spark.read.jdbc(url, table, properties)

// Alternatively
val table = spark.read.format("jdbc").options(...).load(...)

1

2

3

4

5

6

7

8

val table = spark.read.jdbc(url, table, properties)

// Alternatively

val table = spark.read.format("jdbc").options(...).load(...)

These one-liners create a DataFrame that represents the distributed process of loading data from a database and a table (with additional properties).

AvroDataToCatalyst Unary Expression

2012-05-22admin阅读(1462)

AvroDataToCatalyst Unary Expression

AvroDataToCatalyst is a unary expression that represents from_avro function in a structured query.

AvroDataToCatalyst takes the following when created:

Catalyst expression
JSON-encoded Avro schema

AvroDataToCatalyst generates Java source code (as ExprCode) for code-generated expression evaluation.

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — `doGenCode` Method



doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode

1

2

3

4

5

doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode

Note	`doGenCode` is part of Expression Contract to generate a Java source code (`ExprCode`) for code-generated expression evaluation.

doGenCode requests the CodegenContext to generate code to reference this AvroDataToCatalyst instance.

In the end, doGenCode defineCodeGen with the function f that uses nullSafeEval.

`nullSafeEval` Method



nullSafeEval(input: Any): Any

1

2

3

4

5

nullSafeEval(input: Any): Any

Note	`nullSafeEval` is part of the UnaryExpression Contract to…FIXME.

nullSafeEval…FIXME

CatalystDataToAvro Unary Expression

2012-05-21admin阅读(1787)

CatalystDataToAvro Unary Expression

CatalystDataToAvro is a unary expression that represents to_avro function in a structured query.

CatalystDataToAvro takes a single Catalyst expression when created.

CatalystDataToAvro generates Java source code (as ExprCode) for code-generated expression evaluation.



import org.apache.spark.sql.avro.CatalystDataToAvro
val catalystDataToAvro = CatalystDataToAvro($"id".expr)

import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}
val ctx = new CodegenContext
// doGenCode is used when Expression.genCode is executed
// FIXME The following won't work due to https://issues.apache.org/jira/browse/SPARK-26063
val ExprCode(code, _, _) = catalystDataToAvro.genCode(ctx)

// Helper methods
def trim(code: String): String = {
  code.trim.split("\n").map(_.trim).filter(line => line.nonEmpty).mkString("\n")
}
def prettyPrint(code: String) = println(trim(code))
// END: Helper methods

scala> println(trim(code))
// FIXME: Finish me once https://issues.apache.org/jira/browse/SPARK-26063 is fixed
// See the following example

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

import org.apache.spark.sql.avro.CatalystDataToAvro

val catalystDataToAvro = CatalystDataToAvro($"id".expr)

import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}

val ctx = new CodegenContext

// doGenCode is used when Expression.genCode is executed

// FIXME The following won't work due to https://issues.apache.org/jira/browse/SPARK-26063

val ExprCode(code, _, _) = catalystDataToAvro.genCode(ctx)

// Helper methods

def trim(code: String): String = {

code.trim.split("\n").map(_.trim).filter(line => line.nonEmpty).mkString("\n")

}

def prettyPrint(code: String) = println(trim(code))

// END: Helper methods

scala> println(trim(code))

// FIXME: Finish me once https://issues.apache.org/jira/browse/SPARK-26063 is fixed

// See the following example



// Let's use a workaround to create a CatalystDataToAvro expression
// with the child resolved
val q = spark.range(1).withColumn("to_avro_id", to_avro('id))
import org.apache.spark.sql.avro.CatalystDataToAvro
val analyzedPlan = q.queryExecution.analyzed
val catalystDataToAvro = analyzedPlan.expressions.drop(1).head.children.head.asInstanceOf[CatalystDataToAvro]

import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}
val ctx = new CodegenContext
val ExprCode(code, _, _) = catalystDataToAvro.genCode(ctx)

// Doh! It does not work either
// java.lang.UnsupportedOperationException: Cannot evaluate expression: id#38L

// Let's try something else (more serious)

import org.apache.spark.sql.catalyst.expressions.{BindReferences, Expression}
val boundExprs = analyzedPlan.expressions.map { e =>
  BindReferences.bindReference[Expression](e, analyzedPlan.children.head.output)
}
// That should trigger doGenCode
val codes = ctx.generateExpressions(boundExprs)

// The following corresponds to catalystDataToAvro.genCode(ctx)
val ExprCode(code, _, _) = codes.tail.head

// Helper methods
def trim(code: String): String = {
  code.trim.split("\n").map(_.trim).filter(line => line.nonEmpty).mkString("\n")
}
def prettyPrint(code: String) = println(trim(code))
// END: Helper methods

scala> println(trim(code.toString))
long value_7 = i.getLong(0);
byte[] value_6 = null;
value_6 = (byte[]) ((org.apache.spark.sql.avro.CatalystDataToAvro) references[2] /* this */).nullSafeEval(value_7);

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

// Let's use a workaround to create a CatalystDataToAvro expression

// with the child resolved

val q = spark.range(1).withColumn("to_avro_id", to_avro('id))

import org.apache.spark.sql.avro.CatalystDataToAvro

val analyzedPlan = q.queryExecution.analyzed

val catalystDataToAvro = analyzedPlan.expressions.drop(1).head.children.head.asInstanceOf[CatalystDataToAvro]

import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}

val ctx = new CodegenContext

val ExprCode(code, _, _) = catalystDataToAvro.genCode(ctx)

// Doh! It does not work either

// java.lang.UnsupportedOperationException: Cannot evaluate expression: id#38L

// Let's try something else (more serious)

import org.apache.spark.sql.catalyst.expressions.{BindReferences, Expression}

val boundExprs = analyzedPlan.expressions.map { e =>

BindReferences.bindReference[Expression](e, analyzedPlan.children.head.output)

}

// That should trigger doGenCode

val codes = ctx.generateExpressions(boundExprs)

// The following corresponds to catalystDataToAvro.genCode(ctx)

val ExprCode(code, _, _) = codes.tail.head

// Helper methods

def trim(code: String): String = {

code.trim.split("\n").map(_.trim).filter(line => line.nonEmpty).mkString("\n")

}

def prettyPrint(code: String) = println(trim(code))

// END: Helper methods

scala> println(trim(code.toString))

long value_7 = i.getLong(0);

byte[] value_6 = null;

value_6 = (byte[]) ((org.apache.spark.sql.avro.CatalystDataToAvro) references[2] /* this */).nullSafeEval(value_7);

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — `doGenCode` Method



doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode

1

2

3

4

5

doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode

Note	`doGenCode` is part of Expression Contract to generate a Java source code (`ExprCode`) for code-generated expression evaluation.

doGenCode requests the CodegenContext to generate code to reference this CatalystDataToAvro instance.

In the end, doGenCode defineCodeGen with the function f that uses nullSafeEval.

`nullSafeEval` Method



nullSafeEval(input: Any): Any

1

2

3

4

5

nullSafeEval(input: Any): Any

Note	`nullSafeEval` is part of the UnaryExpression Contract to…FIXME.

nullSafeEval…FIXME

AvroOptions — Avro Data Source Options

2012-05-20admin阅读(1560)

AvroOptions — Avro Data Source Options

AvroOptions represents the options of the Avro data source.

Option / Key Default Value Description

avroSchema

(undefined)

Avro schema in JSON format

compression

(undefined)

Specifies the compression codec to use when writing Avro data to disk

Note	If the option is not defined explicitly, Avro data source uses spark.sql.avro.compression.codec configuration property.

ignoreExtension

false

Controls whether Avro data source should read all Avro files regardless of their extension (true) or not (false)

By default, Avro data source reads only files with .avro file extension.

Note	If the option is not defined explicitly, Avro data source uses `avro.mapred.ignore.inputs.without.extension` Hadoop runtime property.

recordName

topLevelRecord

Top-level record name when writing Avro data to disk

Consult Apache Avro™ 1.8.2 Specification

recordNamespace

(empty)

Record namespace when writing Avro data to disk

Consult Apache Avro™ 1.8.2 Specification

Note	The options are case-insensitive.

AvroOptions is created when AvroFileFormat is requested to inferSchema, prepareWrite and buildReader.

Creating AvroOptions Instance

AvroOptions takes the following when created:

Case-insensitive configuration parameters (i.e. Map[String, String])
Hadoop Configuration

AvroFileFormat — FileFormat For Avro-Encoded Files

2012-05-19admin阅读(1547)

AvroFileFormat — FileFormat For Avro-Encoded Files

AvroFileFormat is a FileFormat for Apache Avro, i.e. a data source format that can read and write Avro-encoded data in files.

AvroFileFormat is a DataSourceRegister and registers itself as avro data source.



// ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0

// Writing data to Avro file(s)
spark
  .range(1)
  .write
  .format("avro") // <-- Triggers AvroFileFormat
  .save("data.avro")

// Reading Avro data from file(s)
val q = spark
  .read
  .format("avro") // <-- Triggers AvroFileFormat
  .load("data.avro")
scala> q.show
+---+
| id|
+---+
|  0|
+---+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

// ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0

// Writing data to Avro file(s)

spark

.range(1)

.write

.format("avro") // <-- Triggers AvroFileFormat

.save("data.avro")

// Reading Avro data from file(s)

val q = spark

.read

.format("avro") // <-- Triggers AvroFileFormat

.load("data.avro")

scala> q.show

+---+

| id|

+---+

| 0|

+---+

AvroFileFormat is splitable, i.e. FIXME

Building Partitioned Data Reader — `buildReader` Method



buildReader(
  spark: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

1

2

3

4

5

6

7

8

9

10

11

12

buildReader(

spark: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

Note	`buildReader` is part of the FileFormat Contract to build a PartitionedFile reader.

buildReader…FIXME

Inferring Schema — `inferSchema` Method



inferSchema(
  spark: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

1

2

3

4

5

6

7

8

inferSchema(

spark: SparkSession,

options: Map[String, String],

files: Seq[FileStatus]): Option[StructType]

Note	`inferSchema` is part of the FileFormat Contract to infer (return) the schema of the given files.

inferSchema…FIXME

Preparing Write Job — `prepareWrite` Method



prepareWrite(
  spark: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

1

2

3

4

5

6

7

8

9

prepareWrite(

spark: SparkSession,

job: Job,

options: Map[String, String],

dataSchema: StructType): OutputWriterFactory

Note	`prepareWrite` is part of the FileFormat Contract to prepare a write job.

prepareWrite…FIXME

spark-sql 第44页

JdbcDialect

getCatalystType Method

getJDBCType Method

quoteIdentifier Method

getTableExistsQuery Method

getSchemaQuery Method

getTruncateQuery Method

beforeFetch Method

escapeSql Internal Method

compileValue Method

isCascadingTruncateTable Method

JDBCRDD

Computing Partition (in TaskContext) — compute Method

resolveTable Method

Creating RDD for Distributed Data Scan — scanTable Object Method

Creating JDBCRDD Instance

getPartitions Method

pruneSchema Internal Method

Converting Filter Predicate to SQL Expression — compileFilter Object Method

JDBCRelation — Relation with Inserting or Overwriting Data, Column Pruning and Filter Pushdown

Creating JDBCRelation Instance

Finding Unhandled Filter Predicates — unhandledFilters Method

Schema of Tuples (Data) — schema Property

Inserting or Overwriting Data to JDBC Table — insert Method

Building Distributed Data Scan with Column Pruning and Filter Pushdown — buildScan Method

JdbcRelationProvider

Loading Data from Table Using JDBC — createRelation Method (from RelationProvider)

Writing Rows of Structured Query (DataFrame) to Table Using JDBC — createRelation Method (from CreatableRelationProvider)

JDBCOptions — JDBC Data Source Options

Creating JDBCOptions Instance

Converting Parameters (Options) to Java Properties — asProperties Property

asConnectionProperties Property

JDBC Data Source

AvroDataToCatalyst Unary Expression

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — doGenCode Method

nullSafeEval Method

CatalystDataToAvro Unary Expression

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — doGenCode Method

nullSafeEval Method

AvroOptions — Avro Data Source Options

Creating AvroOptions Instance

AvroFileFormat — FileFormat For Avro-Encoded Files

Building Partitioned Data Reader — buildReader Method

Inferring Schema — inferSchema Method

Preparing Write Job — prepareWrite Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

`getCatalystType` Method

`getJDBCType` Method

`quoteIdentifier` Method

`getTableExistsQuery` Method

`getSchemaQuery` Method

`getTruncateQuery` Method

`beforeFetch` Method

`escapeSql` Internal Method

`compileValue` Method

`isCascadingTruncateTable` Method

Computing Partition (in TaskContext) — `compute` Method

`resolveTable` Method

Creating RDD for Distributed Data Scan — `scanTable` Object Method

`getPartitions` Method

`pruneSchema` Internal Method

Converting Filter Predicate to SQL Expression — `compileFilter` Object Method

Finding Unhandled Filter Predicates — `unhandledFilters` Method

Schema of Tuples (Data) — `schema` Property

Inserting or Overwriting Data to JDBC Table — `insert` Method

Building Distributed Data Scan with Column Pruning and Filter Pushdown — `buildScan` Method

Loading Data from Table Using JDBC — `createRelation` Method (from RelationProvider)

Writing Rows of Structured Query (DataFrame) to Table Using JDBC — `createRelation` Method (from CreatableRelationProvider)

Converting Parameters (Options) to Java Properties — `asProperties` Property

`asConnectionProperties` Property

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — `doGenCode` Method

`nullSafeEval` Method

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — `doGenCode` Method

`nullSafeEval` Method

Building Partitioned Data Reader — `buildReader` Method

Inferring Schema — `inferSchema` Method

Preparing Write Job — `prepareWrite` Method