关注 spark技术分享,
撸spark源码 玩spark最佳实践

Avro Data Source

Avro Data Source

Spark SQL supports structured queries over Avro files as well as in columns (in a DataFrame).

Note

Apache Avro is a data serialization format and provides the following features:

  • Language-independent (with language bindings for popular programming languages, e.g. Java, Python)

  • Rich data structures

  • A compact, fast, binary data format (encoding)

  • A container file for sequences of Avro data (aka Avro data files)

  • Remote procedure call (RPC)

  • Optional code generation (optimization) to read or write data files, and implement RPC protocols

Avro data source is provided by the spark-avro external module. You should include it as a dependency in your Spark application (e.g. spark-submit --packages or in build.sbt).

The following shows how to include the spark-avro module in a spark-shell session.

Table 1. Functions for Avro
Name Description

from_avro

Parses an Avro-encoded binary column and converts to a Catalyst value per JSON-encoded Avro schema

to_avro

Converts a column to an Avro-encoded binary column

After the module is loaded, you should import the org.apache.spark.sql.avro package to have the from_avro and to_avro functions available.

Converting Column to Avro-Encoded Binary Column — to_avro Method

to_avro creates a Column with the CatalystDataToAvro unary expression (with the Catalyst expression of the given data column).

Converting Avro-Encoded Column to Catalyst Value — from_avro Method

from_avro creates a Column with the AvroDataToCatalyst unary expression (with the Catalyst expression of the given data column and the jsonFormatSchema JSON-encoded schema).

赞(0) 打赏
未经允许不得转载:spark技术分享 » Avro Data Source
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏