关注 spark技术分享,
撸spark源码 玩spark最佳实践

Encoder — Internal Row Converter

Encoder — Internal Row Converter

Encoder is the fundamental concept in the serialization and deserialization (SerDe) framework in Spark SQL 2.0. Spark SQL uses the SerDe framework for IO to make it efficient time- and space-wise.

Tip
Spark has borrowed the idea from the Hive SerDe library so it might be worthwhile to get familiar with Hive a little bit, too.

Encoders are modelled in Spark SQL 2.0 as Encoder[T] trait.

The type T stands for the type of records a Encoder[T] can deal with. An encoder of type T, i.e. Encoder[T], is used to convert (encode and decode) any JVM object or primitive of type T (that could be your domain object) to and from Spark SQL’s InternalRow which is the internal binary row format representation (using Catalyst expressions and code generation).

Note
Encoder is also called “a container of serde expressions in Dataset”.
Note
The one and only implementation of the Encoder trait in Spark SQL 2 is ExpressionEncoder.

Encoders are integral (and internal) part of any Dataset[T] (of records of type T) with a Encoder[T] that is used to serialize and deserialize the records of this dataset.

Note
Dataset[T] type is a Scala type constructor with the type parameter T. So is Encoder[T] that handles serialization and deserialization of T to the internal representation.

Encoders know the schema of the records. This is how they offer significantly faster serialization and deserialization (comparing to the default Java or Kryo serializers).

You can create custom encoders using static methods of Encoders object. Note however that encoders for common Scala types and their product types are already available in implicits object.

Tip
The default encoders are already imported in spark-shell.

Encoders map columns (of your dataset) to fields (of your JVM object) by name. It is by Encoders that you can bridge JVM objects to data sources (CSV, JDBC, Parquet, Avro, JSON, Cassandra, Elasticsearch, memsql) and vice versa.

Note
In Spark SQL 2.0 DataFrame type is a mere type alias for Dataset[Row] with RowEncoder being the encoder.

Creating Custom Encoders (Encoders object)

Encoders factory object defines methods to create Encoder instances.

Import org.apache.spark.sql package to have access to the Encoders factory object.

You can find methods to create encoders for Java’s object types, e.g. Boolean, Integer, Long, Double, String, java.sql.Timestamp or Byte array, that could be composed to create more advanced encoders for Java bean classes (using bean method).

You can also create encoders based on Kryo or Java serializers.

You can create encoders for Scala’s tuples and case classes, Int, Long, Double, etc.

Further Reading and Watching

赞(0) 打赏
未经允许不得转载:spark技术分享 » Encoder — Internal Row Converter
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏