Avro Data Source
Spark SQL supports structured queries over Avro files as well as in columns (in a DataFrame
).
Note
|
Apache Avro is a data serialization format and provides the following features:
|
Avro data source is provided by the spark-avro
external module. You should include it as a dependency in your Spark application (e.g. spark-submit --packages
or in build.sbt
).
The following shows how to include the spark-avro
module in a spark-shell
session.
Name | Description |
---|---|
Parses an Avro-encoded binary column and converts to a Catalyst value per JSON-encoded Avro schema |
|
Converts a column to an Avro-encoded binary column |
After the module is loaded, you should import the org.apache.spark.sql.avro
package to have the from_avro and to_avro functions available.
Converting Column to Avro-Encoded Binary Column — to_avro
Method
to_avro
creates a Column with the CatalystDataToAvro unary expression (with the Catalyst expression of the given data
column).
Converting Avro-Encoded Column to Catalyst Value — from_avro
Method
from_avro
creates a Column with the AvroDataToCatalyst unary expression (with the Catalyst expression of the given data
column and the jsonFormatSchema
JSON-encoded schema).