关注 spark技术分享,
撸spark源码 玩spark最佳实践

DataFrameReader — Loading Data From External Data Sources

DataFrameReader — Loading Data From External Data Sources

DataFrameReader is the public interface to describe how to load data from an external data source (e.g. files, tables, JDBC or Dataset[String]).

Table 1. DataFrameReader API
Method Description

csv

format

jdbc

json

load

option

options

orc

parquet

schema

table

text

textFile

DataFrameReader is available using SparkSession.read.

DataFrameReader supports many file formats natively and offers the interface to define custom formats.

Note
DataFrameReader assumes parquet data source file format by default that you can change using spark.sql.sources.default Spark property.

After you have described the loading pipeline (i.e. the “Extract” part of ETL in Spark SQL), you eventually “trigger” the loading using format-agnostic load or format-specific (e.g. json, csv, jdbc) operators.

Note
All methods of DataFrameReader merely describe a process of loading a data and do not trigger a Spark job (until an action is called).

DataFrameReader can read text files using textFile methods that return typed Datasets.

Note
Loading datasets using textFile methods allows for additional preprocessing before final processing of the string values as json or csv lines.

(New in Spark 2.2) DataFrameReader can load datasets from Dataset[String] (with lines being complete “files”) using format-specific csv and json operators.

Table 2. DataFrameReader’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

extraOptions

Used when…​FIXME

source

Name of the input data source (aka format or provider) with the default format per spark.sql.sources.default configuration property (default: parquet).

source can be changed using format method.

Used exclusively when DataFrameReader is requested to load.

userSpecifiedSchema

Optional used-specified schema (default: None, i.e. undefined)

Set when DataFrameReader is requested to set a schema, load a data from an external data source, loadV1Source (when creating a DataSource), and load a data using json and csv file formats

Used when DataFrameReader is requested to assertNoSpecifiedSchema (while loading data using jdbc, table and textFile)

Specifying Format Of Input Data Source — format method

You use format to configure DataFrameReader to use appropriate source format.

Supported data formats:

  • json

  • csv (since 2.0.0)

  • parquet (see Parquet)

  • orc

  • text

  • jdbc

  • libsvm — only when used in format("libsvm")

Note
Spark SQL allows for developing custom data source formats.

Specifying Schema — schema method

schema allows for specyfing the schema of a data source (that the DataFrameReader is about to read a dataset from).

Note
Some formats can infer schema from datasets (e.g. csv or json) using inferSchema option.
Tip
Read up on Schema.

Specifying Load Options — option and options Methods

You can also use options method to describe different options in a single Map.

Loading Datasets from Files (into DataFrames) Using Format-Specific Load Operators

DataFrameReader supports the following file formats:

json method

New in 2.0.0: prefersDecimal

csv method

parquet method

The supported options:

New in 2.0.0: snappy is the default Parquet codec. See [SPARK-14482][SQL] Change default Parquet codec from gzip to snappy.

The compressions supported:

  • none or uncompressed

  • snappy – the default codec in Spark 2.0.0.

  • gzip – the default codec in Spark before 2.0.0

  • lzo

orc method

Optimized Row Columnar (ORC) file format is a highly efficient columnar format to store Hive data with more than 1,000 columns and improve performance. ORC format was introduced in Hive version 0.11 to use and retain the type information from the table definition.

Tip
Read ORC Files document to learn about the ORC file format.

text method

text method loads a text file.

Example

Loading Table to DataFrame — table Method

table loads the content of the tableName table into an untyped DataFrame.

Note
table simply passes the call to SparkSession.table after making sure that a user-defined schema has not been specified.

Loading Data From External Table using JDBC Data Source — jdbc Method

jdbc loads data from an external table using the JDBC data source.

Internally, jdbc creates a JDBCOptions from the input url, table and extraOptions with connectionProperties.

jdbc then creates one JDBCPartition per predicates.

In the end, jdbc requests the SparkSession to create a DataFrame for a JDBCRelation (with JDBCPartitions and JDBCOptions created earlier).

Note

jdbc does not support a custom schema and throws an AnalysisException if defined:

Note
jdbc method uses java.util.Properties (and appears overly Java-centric). Use format(“jdbc”) instead.

Loading Datasets From Text Files — textFile Method

textFile loads one or many text files into a typed Dataset[String].

Note
textFile are similar to text family of methods in that they both read text files but text methods return untyped DataFrame while textFile return typed Dataset[String].

Internally, textFile passes calls on to text method and selects the only value column before it applies Encoders.STRING encoder.

Creating DataFrameReader Instance

DataFrameReader takes the following when created:

Loading Dataset (Data Source API V1) — loadV1Source Internal Method

loadV1Source creates a DataSource and requests it to resolve the underlying relation (as a BaseRelation).

In the end, loadV1Source requests SparkSession to create a DataFrame from the BaseRelation.

Note
loadV1Source is used when DataFrameReader is requested to load (and the data source is neither of DataSourceV2 type nor a DataSourceReader could not be created).

Loading Dataset from Data Source — load Method

load loads a dataset from a data source (with optional support for multiple paths) as an untyped DataFrame.

Internally, load lookupDataSource for the source. load then branches off per its type (i.e. whether it is of DataSourceV2 marker type or not).

For a “Data Source V2” data source, load…​FIXME

Otherwise, if the source is not a “Data Source V2” data source, load simply loadV1Source.

load throws a AnalysisException when the source format is hive.

assertNoSpecifiedSchema Internal Method

assertNoSpecifiedSchema throws a AnalysisException if the userSpecifiedSchema is defined.

Note
assertNoSpecifiedSchema is used when DataFrameReader is requested to load data using jdbc, table and textFile.

verifyColumnNameOfCorruptRecord Internal Method

verifyColumnNameOfCorruptRecord…​FIXME

Note
verifyColumnNameOfCorruptRecord is used when DataFrameReader is requested to load data using json and csv.
赞(0) 打赏
未经允许不得转载:spark技术分享 » DataFrameReader — Loading Data From External Data Sources
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏