Schema — Structure of Data-spark技术分享

Schema — Structure of Data

A schema is the description of the structure of your data (which together create a Dataset in Spark SQL). It can be implicit (and inferred at runtime) or explicit (and known at compile time).

A schema is described using StructType which is a collection of StructField objects (that in turn are tuples of names, types, and nullability classifier).

StructType and StructField belong to the org.apache.spark.sql.types package.



import org.apache.spark.sql.types.StructType
val schemaUntyped = new StructType()
  .add("a", "int")
  .add("b", "string")

// alternatively using Schema DSL
val schemaUntyped_2 = new StructType()
  .add($"a".int)
  .add($"b".string)

import org.apache.spark.sql.types.StructType

val schemaUntyped = new StructType()

.add("a", "int")

.add("b", "string")

// alternatively using Schema DSL

val schemaUntyped_2 = new StructType()

.add($"a".int)

.add($"b".string)

You can use the canonical string representation of SQL types to describe the types in a schema (that is inherently untyped at compile type) or use type-safe types from the org.apache.spark.sql.types package.



// it is equivalent to the above expressions
import org.apache.spark.sql.types.{IntegerType, StringType}
val schemaTyped = new StructType()
  .add("a", IntegerType)
  .add("b", StringType)

// it is equivalent to the above expressions

import org.apache.spark.sql.types.{IntegerType, StringType}

val schemaTyped = new StructType()

.add("a", IntegerType)

.add("b", StringType)

Tip	Read up on CatalystSqlParser that is responsible for parsing data types.

It is however recommended to use the singleton DataTypes class with static methods to create schema types.



import org.apache.spark.sql.types.DataTypes._
val schemaWithMap = StructType(
  StructField("map", createMapType(LongType, StringType), false) :: Nil)

import org.apache.spark.sql.types.DataTypes._

val schemaWithMap = StructType(

StructField("map", createMapType(LongType, StringType), false) :: Nil)

StructType offers printTreeString that makes presenting the schema more user-friendly.



scala> schemaTyped.printTreeString
root
 |-- a: integer (nullable = true)
 |-- b: string (nullable = true)

scala> schemaWithMap.printTreeString
root
|-- map: map (nullable = false)
|    |-- key: long
|    |-- value: string (valueContainsNull = true)

// You can use prettyJson method on any DataType
scala> println(schema1.prettyJson)
{
 "type" : "struct",
 "fields" : [ {
   "name" : "a",
   "type" : "integer",
   "nullable" : true,
   "metadata" : { }
 }, {
   "name" : "b",
   "type" : "string",
   "nullable" : true,
   "metadata" : { }
 } ]
}

scala> schemaTyped.printTreeString

root

|-- a: integer (nullable = true)

|-- b: string (nullable = true)

scala> schemaWithMap.printTreeString

root

|-- map: map (nullable = false)

| |-- key: long

| |-- value: string (valueContainsNull = true)

// You can use prettyJson method on any DataType

scala> println(schema1.prettyJson)

{

"type" : "struct",

"fields" : [ {

"name" : "a",

"type" : "integer",

"nullable" : true,

"metadata" : { }

}, {

"name" : "b",

"type" : "string",

"nullable" : true,

"metadata" : { }

} ]

}

As of Spark 2.0, you can describe the schema of your strongly-typed datasets using encoders.



import org.apache.spark.sql.Encoders

scala> Encoders.INT.schema.printTreeString
root
 |-- value: integer (nullable = true)

scala> Encoders.product[(String, java.sql.Timestamp)].schema.printTreeString
root
|-- _1: string (nullable = true)
|-- _2: timestamp (nullable = true)

case class Person(id: Long, name: String)
scala> Encoders.product[Person].schema.printTreeString
root
 |-- id: long (nullable = false)
 |-- name: string (nullable = true)

import org.apache.spark.sql.Encoders

scala> Encoders.INT.schema.printTreeString

root

|-- value: integer (nullable = true)

scala> Encoders.product[(String, java.sql.Timestamp)].schema.printTreeString

root

|-- _1: string (nullable = true)

|-- _2: timestamp (nullable = true)

case class Person(id: Long, name: String)

scala> Encoders.product[Person].schema.printTreeString

root

|-- id: long (nullable = false)

|-- name: string (nullable = true)

Implicit Schema



val df = Seq((0, s"""hello\tworld"""), (1, "two  spaces inside")).toDF("label", "sentence")

scala> df.printSchema
root
 |-- label: integer (nullable = false)
 |-- sentence: string (nullable = true)

scala> df.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(label,IntegerType,false), StructField(sentence,StringType,true))

scala> df.schema("label").dataType
res1: org.apache.spark.sql.types.DataType = IntegerType

val df = Seq((0, s"""hello\tworld"""), (1, "two spaces inside")).toDF("label", "sentence")

scala> df.printSchema

root

|-- label: integer (nullable = false)

|-- sentence: string (nullable = true)

scala> df.schema

res0: org.apache.spark.sql.types.StructType = StructType(StructField(label,IntegerType,false), StructField(sentence,StringType,true))

scala> df.schema("label").dataType

res1: org.apache.spark.sql.types.DataType = IntegerType

Schema — Structure of Data

Schema — Structure of Data

Implicit Schema

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部