spark-sql-spark技术分享-第53页

Aggregator — Contract for User-Defined Typed Aggregate Functions (UDAFs)

2012-02-28admin阅读(1388)

Aggregator — Contract for User-Defined Typed Aggregate Functions (UDAFs)

Aggregator is the contract for user-defined typed aggregate functions (aka user-defined typed aggregations or UDAFs in short).



package org.apache.spark.sql.expressions

abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
  // only required methods that have no implementation
  def bufferEncoder: Encoder[BUF]
  def finish(reduction: BUF): OUT
  def merge(b1: BUF, b2: BUF): BUF
  def outputEncoder: Encoder[OUT]
  def reduce(b: BUF, a: IN): BUF
  def zero: BUF
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

package org.apache.spark.sql.expressions

abstract class Aggregator[-IN, BUF, OUT] extends Serializable {

// only required methods that have no implementation

def bufferEncoder: Encoder[BUF]

def finish(reduction: BUF): OUT

def merge(b1: BUF, b2: BUF): BUF

def outputEncoder: Encoder[OUT]

def reduce(b: BUF, a: IN): BUF

def zero: BUF

}

After you create a custom Aggregator, you should use toColumn method to convert it to a TypedColumn that can be used with Dataset.select and KeyValueGroupedDataset.agg typed operators.



// From Spark MLlib's org.apache.spark.ml.recommendation.ALSModel
// Step 1. Create Aggregator
val topKAggregator: Aggregator[Int, Int, Float] = ???
val recs = ratings
  .as[(Int, Int, Float)]
  .groupByKey(_._1)
  .agg(topKAggregator.toColumn) // <-- use the custom Aggregator
  .toDF("id", "recommendations")

1

2

3

4

5

6

7

8

9

10

11

12

// From Spark MLlib's org.apache.spark.ml.recommendation.ALSModel

// Step 1. Create Aggregator

val topKAggregator: Aggregator[Int, Int, Float] = ???

val recs = ratings

.groupByKey(_._1)

.agg(topKAggregator.toColumn) // <-- use the custom Aggregator

.toDF("id", "recommendations")

Note

Use org.apache.spark.sql.expressions.scalalang.typed object to access the type-safe aggregate functions, i.e. avg, count, sum and sumLong.



import org.apache.spark.sql.expressions.scalalang.typed

// Example 1
ds.groupByKey(_._1).agg(typed.sum(_._2))

// Example 2
ds.select(typed.sum((i: Int) => i))

1

2

3

4

5

6

7

8

9

10

11

import org.apache.spark.sql.expressions.scalalang.typed

// Example 1

ds.groupByKey(_._1).agg(typed.sum(_._2))

// Example 2

ds.select(typed.sum((i: Int) => i))

Note	`Aggregator` is an `Experimental` and `Evolving` contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release. In other words, using the contract is as treading on thin ice.

Aggregator is used when:

SimpleTypedAggregateExpression and ComplexTypedAggregateExpression are created
TypedAggregateExpression is requested for the aggregator

Table 1. Aggregator Contract
Method	Description
`bufferEncoder`	Used when…FIXME
`finish`	Used when…FIXME
`merge`	Used when…FIXME
`outputEncoder`	Used when…FIXME
`reduce`	Used when…FIXME
`zero`	Used when…FIXME

Table 2. Aggregators
Aggregator	Description
`ParameterizedTypeSum`
`ReduceAggregator`
`TopByKeyAggregator`	Used exclusively in Spark MLlib
`TypedAverage`
`TypedCount`
`TypedSumDouble`
`TypedSumLong`

Converting Aggregator to TypedColumn — `toColumn` Method



toColumn: TypedColumn[IN, OUT]

1

2

3

4

5

toColumn: TypedColumn[IN, OUT]

toColumn…FIXME

Note	`toColumn` is used when…FIXME

UserDefinedAggregateFunction — Contract for User-Defined Untyped Aggregate Functions (UDAFs)

2012-02-27admin阅读(1280)

UserDefinedAggregateFunction — Contract for User-Defined Untyped Aggregate Functions (UDAFs)

UserDefinedAggregateFunction is the contract to define user-defined aggregate functions (UDAFs).



// Custom UDAF to count rows

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, LongType, StructType}

class MyCountUDAF extends UserDefinedAggregateFunction {
  override def inputSchema: StructType = {
    new StructType().add("id", LongType, nullable = true)
  }

  override def bufferSchema: StructType = {
    new StructType().add("count", LongType, nullable = true)
  }

  override def dataType: DataType = LongType

  override def deterministic: Boolean = true

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    println(s">>> initialize (buffer: $buffer)")
    // NOTE: Scala's update used under the covers
    buffer(0) = 0L
  }

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    println(s">>> update (buffer: $buffer -> input: $input)")
    buffer(0) = buffer.getLong(0) + 1
  }

  override def merge(buffer: MutableAggregationBuffer, row: Row): Unit = {
    println(s">>> merge (buffer: $buffer -> row: $row)")
    buffer(0) = buffer.getLong(0) + row.getLong(0)
  }

  override def evaluate(buffer: Row): Any = {
    println(s">>> evaluate (buffer: $buffer)")
    buffer.getLong(0)
  }
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

// Custom UDAF to count rows

import org.apache.spark.sql.Row

import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}

import org.apache.spark.sql.types.{DataType, LongType, StructType}

class MyCountUDAF extends UserDefinedAggregateFunction {

override def inputSchema: StructType = {

new StructType().add("id", LongType, nullable = true)

}

override def bufferSchema: StructType = {

new StructType().add("count", LongType, nullable = true)

}

override def dataType: DataType = LongType

override def deterministic: Boolean = true

override def initialize(buffer: MutableAggregationBuffer): Unit = {

println(s">>> initialize (buffer: $buffer)")

// NOTE: Scala's update used under the covers

buffer(0) = 0L

}

override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {

println(s">>> update (buffer: $buffer -> input: $input)")

buffer(0) = buffer.getLong(0) + 1

}

override def merge(buffer: MutableAggregationBuffer, row: Row): Unit = {

println(s">>> merge (buffer: $buffer -> row: $row)")

buffer(0) = buffer.getLong(0) + row.getLong(0)

}

override def evaluate(buffer: Row): Any = {

println(s">>> evaluate (buffer: $buffer)")

buffer.getLong(0)

}

UserDefinedAggregateFunction is created using apply or distinct factory methods.



val dataset = spark.range(start = 0, end = 4, step = 1, numPartitions = 2)

// Use the UDAF
val mycount = new MyCountUDAF
val q = dataset.
  withColumn("group", 'id % 2).
  groupBy('group).
  agg(mycount.distinct('id) as "count")
scala> q.show
+-----+-----+
|group|count|
+-----+-----+
|    0|    2|
|    1|    2|
+-----+-----+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

val dataset = spark.range(start = 0, end = 4, step = 1, numPartitions = 2)

// Use the UDAF

val mycount = new MyCountUDAF

val q = dataset.

withColumn("group", 'id % 2).

groupBy('group).

agg(mycount.distinct('id) as "count")

scala> q.show

+-----+-----+

|group|count|

+-----+-----+

| 0| 2|

| 1| 2|

+-----+-----+

The lifecycle of UserDefinedAggregateFunction is entirely managed using ScalaUDAF expression container.

spark sql UserDefinedAggregateFunction.png

Figure 1. UserDefinedAggregateFunction and ScalaUDAF Expression Container

Note

Use UDFRegistration to register a (temporary) UserDefinedAggregateFunction and use it in SQL mode.



import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
val mycount: UserDefinedAggregateFunction = ...
spark.udf.register("mycount", mycount)

spark.sql("SELECT mycount(*) FROM range(5)")

1

2

3

4

5

6

7

8

9

import org.apache.spark.sql.expressions.UserDefinedAggregateFunction

val mycount: UserDefinedAggregateFunction = ...

spark.udf.register("mycount", mycount)

spark.sql("SELECT mycount(*) FROM range(5)")

UserDefinedAggregateFunction Contract



package org.apache.spark.sql.expressions

abstract class UserDefinedAggregateFunction {
  // only required methods that have no implementation
  def bufferSchema: StructType
  def dataType: DataType
  def deterministic: Boolean
  def evaluate(buffer: Row): Any
  def initialize(buffer: MutableAggregationBuffer): Unit
  def inputSchema: StructType
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
  def update(buffer: MutableAggregationBuffer, input: Row): Unit
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

package org.apache.spark.sql.expressions

abstract class UserDefinedAggregateFunction {

// only required methods that have no implementation

def bufferSchema: StructType

def dataType: DataType

def deterministic: Boolean

def evaluate(buffer: Row): Any

def initialize(buffer: MutableAggregationBuffer): Unit

def inputSchema: StructType

def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit

def update(buffer: MutableAggregationBuffer, input: Row): Unit

}

Table 1. (Subset of) UserDefinedAggregateFunction Contract
Method	Description
`bufferSchema`
`dataType`
`deterministic`
`evaluate`
`initialize`
`inputSchema`
`merge`
`update`

Creating Column for UDAF — `apply` Method



apply(exprs: Column*): Column

1

2

3

4

5

apply(exprs: Column*): Column

apply creates a Column with ScalaUDAF (inside AggregateExpression).

Note	`AggregateExpression` uses `Complete` mode and `isDistinct` flag is disabled.



import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
val myUDAF: UserDefinedAggregateFunction = ...
val myUdafCol = myUDAF.apply($"id", $"name")
scala> myUdafCol.explain(extended = true)
mycountudaf('id, 'name, $line17.$read$$iw$$iw$MyCountUDAF@4704b66a, 0, 0)

scala> println(myUdafCol.expr.numberedTreeString)
00 mycountudaf('id, 'name, $line17.$read$$iw$$iw$MyCountUDAF@4704b66a, 0, 0)
01 +- MyCountUDAF('id,'name)
02    :- 'id
03    +- 'name

import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
myUdafCol.expr.asInstanceOf[AggregateExpression]

import org.apache.spark.sql.execution.aggregate.ScalaUDAF
val scalaUdaf = myUdafCol.expr.children.head.asInstanceOf[ScalaUDAF]
scala> println(scalaUdaf.toString)
MyCountUDAF('id,'name)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

import org.apache.spark.sql.expressions.UserDefinedAggregateFunction

val myUDAF: UserDefinedAggregateFunction = ...

val myUdafCol = myUDAF.apply($"id", $"name")

scala> myUdafCol.explain(extended = true)

mycountudaf('id, 'name, $line17.$read$$iw$$iw$MyCountUDAF@4704b66a, 0, 0)

scala> println(myUdafCol.expr.numberedTreeString)

00 mycountudaf('id, 'name, $line17.$read$$iw$$iw$MyCountUDAF@4704b66a, 0, 0)

01 +- MyCountUDAF('id,'name)

02 :- 'id

03 +- 'name

import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression

myUdafCol.expr.asInstanceOf[AggregateExpression]

import org.apache.spark.sql.execution.aggregate.ScalaUDAF

val scalaUdaf = myUdafCol.expr.children.head.asInstanceOf[ScalaUDAF]

scala> println(scalaUdaf.toString)

MyCountUDAF('id,'name)

Creating Column for UDAF with Distinct Values — `distinct` Method



distinct(exprs: Column*): Column

1

2

3

4

5

distinct(exprs: Column*): Column

distinct creates a Column with ScalaUDAF (inside AggregateExpression).

Note	`AggregateExpression` uses `Complete` mode and `isDistinct` flag is enabled.

Note	`distinct` is like apply but has `isDistinct` flag enabled.



import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
val myUDAF: UserDefinedAggregateFunction = ...
scala> val myUdafCol = myUDAF.distinct($"id", $"name")
myUdafCol: org.apache.spark.sql.Column = mycountudaf(DISTINCT id, name)

scala> myUdafCol.explain(extended = true)
mycountudaf(distinct 'id, 'name, $line17.$read$$iw$$iw$MyCountUDAF@4704b66a, 0, 0)

import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
val aggExpr = myUdafCol.expr
scala> println(aggExpr.numberedTreeString)
00 mycountudaf(distinct 'id, 'name, $line17.$read$$iw$$iw$MyCountUDAF@4704b66a, 0, 0)
01 +- MyCountUDAF('id,'name)
02    :- 'id
03    +- 'name

scala> aggExpr.asInstanceOf[AggregateExpression].isDistinct
res0: Boolean = true

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

import org.apache.spark.sql.expressions.UserDefinedAggregateFunction

val myUDAF: UserDefinedAggregateFunction = ...

scala> val myUdafCol = myUDAF.distinct($"id", $"name")

myUdafCol: org.apache.spark.sql.Column = mycountudaf(DISTINCT id, name)

scala> myUdafCol.explain(extended = true)

mycountudaf(distinct 'id, 'name, $line17.$read$$iw$$iw$MyCountUDAF@4704b66a, 0, 0)

import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression

val aggExpr = myUdafCol.expr

scala> println(aggExpr.numberedTreeString)

00 mycountudaf(distinct 'id, 'name, $line17.$read$$iw$$iw$MyCountUDAF@4704b66a, 0, 0)

01 +- MyCountUDAF('id,'name)

02 :- 'id

03 +- 'name

scala> aggExpr.asInstanceOf[AggregateExpression].isDistinct

res0: Boolean = true

Dataset Checkpointing

2012-02-26admin阅读(778)

Dataset Checkpointing

Dataset Checkpointing is a feature of Spark SQL to truncate a logical query plan that could specifically be useful for highly iterative data algorithms (e.g. Spark MLlib that uses Spark SQL’s Dataset API for data manipulation).

Note

Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming – the now-obsolete Spark module for stream processing based on RDD API.

Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.

Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.

Checkpointing can be eager or lazy per eager flag of checkpoint operator. Eager checkpointing is the default checkpointing and happens immediately when requested. Lazy checkpointing does not and will only happen when an action is executed.

Using Dataset checkpointing requires that you specify the checkpoint directory. The directory stores the checkpoint files for RDDs to be checkpointed. Use SparkContext.setCheckpointDir to set the path to a checkpoint directory.

Checkpointing can be local or reliable which defines how reliable the checkpoint directory is. Local checkpointing uses executor storage to write checkpoint files to and due to the executor lifecycle is considered unreliable. Reliable checkpointing uses a reliable data storage like Hadoop HDFS.

Table 1. Dataset Checkpointing Types
	Eager	Lazy
Reliable	checkpoint	checkpoint(eager = false)
Local	localCheckpoint	localCheckpoint(eager = false)

A RDD can be recovered from a checkpoint files using SparkContext.checkpointFile. You can use SparkSession.internalCreateDataFrame method to (re)create the DataFrame from the RDD of internal binary rows.

Tip

Enable INFO logging level for org.apache.spark.rdd.ReliableRDDCheckpointData logger to see what happens while an RDD is checkpointed.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.rdd.ReliableRDDCheckpointData=INFO

1

2

3

4

5

log4j.logger.org.apache.spark.rdd.ReliableRDDCheckpointData=INFO

Refer to Logging.



import org.apache.spark.sql.functions.rand
val nums = spark.range(5).withColumn("random", rand()).filter($"random" > 0.5)
scala> nums.show
+---+------------------+
| id|            random|
+---+------------------+
|  0| 0.752877642067488|
|  1|0.5271005540026181|
+---+------------------+

scala> println(nums.queryExecution.toRdd.toDebugString)
(8) MapPartitionsRDD[7] at toRdd at <console>:27 []
 |  MapPartitionsRDD[6] at toRdd at <console>:27 []
 |  ParallelCollectionRDD[5] at toRdd at <console>:27 []

// Remember to set the checkpoint directory
scala> nums.checkpoint
org.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContext
  at org.apache.spark.rdd.RDD.checkpoint(RDD.scala:1548)
  at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:594)
  at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:539)
  ... 49 elided

spark.sparkContext.setCheckpointDir("/tmp/checkpoints")

val checkpointDir = spark.sparkContext.getCheckpointDir.get
scala> println(checkpointDir)
file:/tmp/checkpoints/b1f413dc-3eaf-46a0-99de-d795252035e0

val numsCheckpointed = nums.checkpoint
scala> println(numsCheckpointed.queryExecution.toRdd.toDebugString)
(8) MapPartitionsRDD[11] at toRdd at <console>:27 []
 |  MapPartitionsRDD[9] at checkpoint at <console>:26 []
 |  ReliableCheckpointRDD[10] at checkpoint at <console>:26 []

// Set org.apache.spark.rdd.ReliableRDDCheckpointData logger to INFO
// to see what happens while an RDD is checkpointed
// Let's use log4j API
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org.apache.spark.rdd.ReliableRDDCheckpointData").setLevel(Level.INFO)

scala> nums.checkpoint
18/03/23 00:05:15 INFO ReliableRDDCheckpointData: Done checkpointing RDD 12 to file:/tmp/checkpoints/b1f413dc-3eaf-46a0-99de-d795252035e0/rdd-12, new parent is RDD 13
res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, random: double]

// Save the schema as it is going to use to reconstruct nums dataset from a RDD
val schema = nums.schema

// Recover nums dataset from the checkpoint files
// Start from recovering the underlying RDD
// And create a Dataset based on the RDD

// Get the path to the checkpoint files of the checkpointed RDD of the Dataset
import org.apache.spark.sql.execution.LogicalRDD
val logicalRDD = numsCheckpointed.queryExecution.optimizedPlan.asInstanceOf[LogicalRDD]
val checkpointFiles = logicalRDD.rdd.getCheckpointFile.get
scala> println(checkpointFiles)
file:/tmp/checkpoints/b1f413dc-3eaf-46a0-99de-d795252035e0/rdd-9

// SparkContext.checkpointFile is a `protected[spark]` method
// Use :paste -raw mode in Spark shell and define a helper object to "escape" the package lock-in
scala> :paste -raw
// Entering paste mode (ctrl-D to finish)

package org.apache.spark
object my {
  import scala.reflect.ClassTag
  import org.apache.spark.rdd.RDD
  def recover[T: ClassTag](sc: SparkContext, path: String): RDD[T] = {
    sc.checkpointFile[T](path)
  }
}

// Exiting paste mode, now interpreting.

// Make sure to use the same checkpoint directory

import org.apache.spark.my
import org.apache.spark.sql.catalyst.InternalRow
val numsRddRecovered = my.recover[InternalRow](spark.sparkContext, checkpointFiles)
scala> :type numsRddRecovered
org.apache.spark.rdd.RDD[org.apache.spark.sql.catalyst.InternalRow]

// We have to convert RDD[InternalRow] to DataFrame

// Use :paste -raw again as we use `private[sql]` method
scala> :pa -raw
// Entering paste mode (ctrl-D to finish)

package org.apache.spark.sql
object my2 {
  import org.apache.spark.rdd.RDD
  import org.apache.spark.sql.{DataFrame, SparkSession}
  import org.apache.spark.sql.catalyst.InternalRow
  import org.apache.spark.sql.types.StructType
  def createDataFrame(spark: SparkSession, catalystRows: RDD[InternalRow], schema: StructType): DataFrame = {
    spark.internalCreateDataFrame(catalystRows, schema)
  }
}

// Exiting paste mode, now interpreting.

import org.apache.spark.sql.my2
val numsRecovered = my2.createDataFrame(spark, numsRddRecovered, schema)
scala> numsRecovered.show
+---+------------------+
| id|            random|
+---+------------------+
|  0| 0.752877642067488|
|  1|0.5271005540026181|
+---+------------------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

import org.apache.spark.sql.functions.rand

val nums = spark.range(5).withColumn("random", rand()).filter($"random" > 0.5)

scala> nums.show

+---+------------------+

| id| random|

+---+------------------+

| 0| 0.752877642067488|

| 1|0.5271005540026181|

+---+------------------+

scala> println(nums.queryExecution.toRdd.toDebugString)

(8) MapPartitionsRDD[7] at toRdd at <console>:27 []

| MapPartitionsRDD[6] at toRdd at <console>:27 []

| ParallelCollectionRDD[5] at toRdd at <console>:27 []

// Remember to set the checkpoint directory

scala> nums.checkpoint

org.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContext

at org.apache.spark.rdd.RDD.checkpoint(RDD.scala:1548)

at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:594)

at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:539)

... 49 elided

spark.sparkContext.setCheckpointDir("/tmp/checkpoints")

val checkpointDir = spark.sparkContext.getCheckpointDir.get

scala> println(checkpointDir)

file:/tmp/checkpoints/b1f413dc-3eaf-46a0-99de-d795252035e0

val numsCheckpointed = nums.checkpoint

scala> println(numsCheckpointed.queryExecution.toRdd.toDebugString)

(8) MapPartitionsRDD[11] at toRdd at <console>:27 []

| MapPartitionsRDD[9] at checkpoint at <console>:26 []

| ReliableCheckpointRDD[10] at checkpoint at <console>:26 []

// Set org.apache.spark.rdd.ReliableRDDCheckpointData logger to INFO

// to see what happens while an RDD is checkpointed

// Let's use log4j API

import org.apache.log4j.{Level, Logger}

Logger.getLogger("org.apache.spark.rdd.ReliableRDDCheckpointData").setLevel(Level.INFO)

scala> nums.checkpoint

18/03/23 00:05:15 INFO ReliableRDDCheckpointData: Done checkpointing RDD 12 to file:/tmp/checkpoints/b1f413dc-3eaf-46a0-99de-d795252035e0/rdd-12, new parent is RDD 13

res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, random: double]

// Save the schema as it is going to use to reconstruct nums dataset from a RDD

val schema = nums.schema

// Recover nums dataset from the checkpoint files

// Start from recovering the underlying RDD

// And create a Dataset based on the RDD

// Get the path to the checkpoint files of the checkpointed RDD of the Dataset

import org.apache.spark.sql.execution.LogicalRDD

val logicalRDD = numsCheckpointed.queryExecution.optimizedPlan.asInstanceOf[LogicalRDD]

val checkpointFiles = logicalRDD.rdd.getCheckpointFile.get

scala> println(checkpointFiles)

file:/tmp/checkpoints/b1f413dc-3eaf-46a0-99de-d795252035e0/rdd-9

// SparkContext.checkpointFile is a `protected[spark]` method

// Use :paste -raw mode in Spark shell and define a helper object to "escape" the package lock-in

scala> :paste -raw

// Entering paste mode (ctrl-D to finish)

package org.apache.spark

object my {

import scala.reflect.ClassTag

import org.apache.spark.rdd.RDD

def recover[T: ClassTag](sc: SparkContext, path: String): RDD[T] = {

sc.checkpointFile[T](path)

}

// Exiting paste mode, now interpreting.

// Make sure to use the same checkpoint directory

import org.apache.spark.my

import org.apache.spark.sql.catalyst.InternalRow

val numsRddRecovered = my.recover[InternalRow](spark.sparkContext, checkpointFiles)

scala> :type numsRddRecovered

org.apache.spark.rdd.RDD[org.apache.spark.sql.catalyst.InternalRow]

// We have to convert RDD[InternalRow] to DataFrame

// Use :paste -raw again as we use `private[sql]` method

scala> :pa -raw

// Entering paste mode (ctrl-D to finish)

package org.apache.spark.sql

object my2 {

import org.apache.spark.rdd.RDD

import org.apache.spark.sql.{DataFrame, SparkSession}

import org.apache.spark.sql.catalyst.InternalRow

import org.apache.spark.sql.types.StructType

def createDataFrame(spark: SparkSession, catalystRows: RDD[InternalRow], schema: StructType): DataFrame = {

spark.internalCreateDataFrame(catalystRows, schema)

}

// Exiting paste mode, now interpreting.

import org.apache.spark.sql.my2

val numsRecovered = my2.createDataFrame(spark, numsRddRecovered, schema)

scala> numsRecovered.show

+---+------------------+

| id| random|

+---+------------------+

| 0| 0.752877642067488|

| 1|0.5271005540026181|

+---+------------------+

Specyfing Checkpoint Directory — `SparkContext.setCheckpointDir` Method



SparkContext.setCheckpointDir(directory: String)

1

2

3

4

5

SparkContext.setCheckpointDir(directory: String)

setCheckpointDir sets the checkpoint directory.

Internally, setCheckpointDir…FIXME

Recovering RDD From Checkpoint Files — `SparkContext.checkpointFile` Method



SparkContext.checkpointFile(directory: String)

1

2

3

4

5

SparkContext.checkpointFile(directory: String)

checkpointFile reads (recovers) a RDD from a checkpoint directory.

Note	`SparkContext.checkpointFile` is a `protected[spark]` method so the code to access it has to be in `org.apache.spark` package.

Internally, checkpointFile creates a ReliableCheckpointRDD in a scope.

User-Friendly Names Of Cached Queries in web UI’s Storage Tab

2012-02-25admin阅读(1001)

User-Friendly Names Of Cached Queries in web UI’s Storage Tab

As you may have noticed, web UI’s Storage tab displays some cached queries with user-friendly RDD names (e.g. “In-memory table [name]”) while others not (e.g. “Scan JDBCRelation…”).

Figure 1. Cached Queries in web UI (Storage Tab)

“In-memory table [name]” RDD names are the result of SQL’s CACHE TABLE or when Catalog is requested to cache a table.



// register Dataset as temporary view (table)
spark.range(1).createOrReplaceTempView("one")
// caching is lazy and won't happen until an action is executed
val one = spark.table("one").cache
// The following gives "*Range (0, 1, step=1, splits=8)"
// WHY?!
one.show

scala> spark.catalog.isCached("one")
res0: Boolean = true

one.unpersist

import org.apache.spark.storage.StorageLevel
// caching is lazy
spark.catalog.cacheTable("one", StorageLevel.MEMORY_ONLY)
// The following gives "In-memory table one"
one.show

spark.range(100).createOrReplaceTempView("hundred")
// SQL's CACHE TABLE is eager
// The following gives "In-memory table `hundred`"
// WHY single quotes?
spark.sql("CACHE TABLE hundred")

// register Dataset under name
val ds = spark.range(20)
spark.sharedState.cacheManager.cacheQuery(ds, Some("twenty"))
// trigger an action
ds.head

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

// register Dataset as temporary view (table)

spark.range(1).createOrReplaceTempView("one")

// caching is lazy and won't happen until an action is executed

val one = spark.table("one").cache

// The following gives "*Range (0, 1, step=1, splits=8)"

// WHY?!

one.show

scala> spark.catalog.isCached("one")

res0: Boolean = true

one.unpersist

import org.apache.spark.storage.StorageLevel

// caching is lazy

spark.catalog.cacheTable("one", StorageLevel.MEMORY_ONLY)

// The following gives "In-memory table one"

one.show

spark.range(100).createOrReplaceTempView("hundred")

// SQL's CACHE TABLE is eager

// The following gives "In-memory table `hundred`"

// WHY single quotes?

spark.sql("CACHE TABLE hundred")

// register Dataset under name

val ds = spark.range(20)

spark.sharedState.cacheManager.cacheQuery(ds, Some("twenty"))

// trigger an action

ds.head

The other RDD names are due to caching a Dataset.



val ten = spark.range(10).cache
ten.head

1

2

3

4

5

6

val ten = spark.range(10).cache

ten.head

Dataset Caching and Persistence

2012-02-24admin阅读(1806)

Dataset Caching and Persistence

Table 1. Caching Operators (Basic Actions)

Operator

Description

cache

Basic action to cache a Dataset



cache(): this.type

1

2

3

4

5

cache(): this.type

persist

Basic action to persist a Dataset



persist(): this.type
persist(newLevel: StorageLevel): this.type

1

2

3

4

5

6

persist(): this.type

persist(newLevel: StorageLevel): this.type

unpersist

Basic action to unpersist a cached Dataset



unpersist(): this.type
unpersist(blocking: Boolean): this.type

1

2

3

4

5

6

unpersist(): this.type

unpersist(blocking: Boolean): this.type



// Cache Dataset -- it is lazy
scala> val df = spark.range(1).cache
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

// Trigger caching
scala> df.show
+---+
| id|
+---+
|  0|
+---+

// Visit http://localhost:4040/storage to see the Dataset cached. It should.

// You may also use queryExecution or explain to see InMemoryRelation
// InMemoryRelation is used for cached queries
scala> df.queryExecution.withCachedData
res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
   +- *Range (0, 1, step=1, splits=Some(8))

// Use the cached Dataset in another query
// Notice InMemoryRelation in use for cached queries
scala> df.withColumn("newId", 'id).explain(extended = true)
== Parsed Logical Plan ==
'Project [*, 'id AS newId#16]
+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, newId: bigint
Project [id#0L, id#0L AS newId#16L]
+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, id#0L AS newId#16L]
+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      +- *Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*Project [id#0L, id#0L AS newId#16L]
+- InMemoryTableScan [id#0L]
      +- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *Range (0, 1, step=1, splits=Some(8))

// Clear in-memory cache using SQL
// Equivalent to spark.catalog.clearCache
scala> sql("CLEAR CACHE").collect
res1: Array[org.apache.spark.sql.Row] = Array()

// Visit http://localhost:4040/storage to confirm the cleaning

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

// Cache Dataset -- it is lazy

scala> val df = spark.range(1).cache

df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

// Trigger caching

scala> df.show

+---+

| id|

+---+

| 0|

+---+

// Visit http://localhost:4040/storage to see the Dataset cached. It should.

// You may also use queryExecution or explain to see InMemoryRelation

// InMemoryRelation is used for cached queries

scala> df.queryExecution.withCachedData

res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =

InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)

+- *Range (0, 1, step=1, splits=Some(8))

// Use the cached Dataset in another query

// Notice InMemoryRelation in use for cached queries

scala> df.withColumn("newId", 'id).explain(extended = true)

== Parsed Logical Plan ==

'Project [*, 'id AS newId#16]

+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==

id: bigint, newId: bigint

Project [id#0L, id#0L AS newId#16L]

+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==

Project [id#0L, id#0L AS newId#16L]

+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)

+- *Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==

*Project [id#0L, id#0L AS newId#16L]

+- InMemoryTableScan [id#0L]

+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)

+- *Range (0, 1, step=1, splits=Some(8))

// Clear in-memory cache using SQL

// Equivalent to spark.catalog.clearCache

scala> sql("CLEAR CACHE").collect

res1: Array[org.apache.spark.sql.Row] = Array()

// Visit http://localhost:4040/storage to confirm the cleaning

Note

You can also use SQL’s CACHE TABLE [tableName] to cache tableName table in memory. Unlike cache and persist operators, CACHE TABLE is an eager operation which is executed as soon as the statement is executed.



sql("CACHE TABLE [tableName]")

1

2

3

4

5

sql("CACHE TABLE [tableName]")

You could however use LAZY keyword to make caching lazy.



sql("CACHE LAZY TABLE [tableName]")

1

2

3

4

5

sql("CACHE LAZY TABLE [tableName]")

Use SQL’s REFRESH TABLE [tableName] to refresh a cached table.

Use SQL’s UNCACHE TABLE (IF EXISTS)? [tableName] to remove a table from cache.

Use SQL’s CLEAR CACHE to remove all tables from cache.

Note

Be careful what you cache, i.e. what Dataset is cached, as it gives different queries cached.



// cache after range(5)
val q1 = spark.range(5).cache.filter($"id" % 2 === 0).select("id")
scala> q1.explain
== Physical Plan ==
*Filter ((id#0L % 2) = 0)
+- InMemoryTableScan [id#0L], [((id#0L % 2) = 0)]
      +- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *Range (0, 5, step=1, splits=8)

// cache at the end
val q2 = spark.range(1).filter($"id" % 2 === 0).select("id").cache
scala> q2.explain
== Physical Plan ==
InMemoryTableScan [id#17L]
   +- InMemoryRelation [id#17L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *Filter ((id#17L % 2) = 0)
            +- *Range (0, 1, step=1, splits=8)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

// cache after range(5)

val q1 = spark.range(5).cache.filter($"id" % 2 === 0).select("id")

scala> q1.explain

== Physical Plan ==

*Filter ((id#0L % 2) = 0)

+- InMemoryTableScan [id#0L], [((id#0L % 2) = 0)]

+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)

+- *Range (0, 5, step=1, splits=8)

// cache at the end

val q2 = spark.range(1).filter($"id" % 2 === 0).select("id").cache

scala> q2.explain

== Physical Plan ==

InMemoryTableScan [id#17L]

+- InMemoryRelation [id#17L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)

+- *Filter ((id#17L % 2) = 0)

+- *Range (0, 1, step=1, splits=8)

Tip

You can check whether a Dataset was cached or not using the following code:



scala> :type q2
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]

val cache = spark.sharedState.cacheManager
scala> cache.lookupCachedData(q2.queryExecution.logical).isDefined
res0: Boolean = false

1

2

3

4

5

6

7

8

9

10

scala> :type q2

org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]

val cache = spark.sharedState.cacheManager

scala> cache.lookupCachedData(q2.queryExecution.logical).isDefined

res0: Boolean = false

SQL’s CACHE TABLE

SQL’s CACHE TABLE corresponds to requesting the session-specific Catalog to caching the table.

Internally, CACHE TABLE becomes CacheTableCommand runnable command that…FIXME

Multi-Dimensional Aggregation

2012-02-23admin阅读(1041)

Multi-Dimensional Aggregation

Multi-dimensional aggregate operators are enhanced variants of groupBy operator that allow you to create queries for subtotals, grand totals and superset of subtotals in one go.



val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")

// very labor-intense
// groupBy's unioned
val groupByCityAndYear = sales
  .groupBy("city", "year")  // <-- subtotals (city, year)
  .agg(sum("amount") as "amount")
val groupByCityOnly = sales
  .groupBy("city")          // <-- subtotals (city)
  .agg(sum("amount") as "amount")
  .select($"city", lit(null) as "year", $"amount")  // <-- year is null
val withUnion = groupByCityAndYear
  .union(groupByCityOnly)
  .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
scala> withUnion.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
+-------+----+------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

val sales = Seq(

("Warsaw", 2016, 100),

("Warsaw", 2017, 200),

("Boston", 2015, 50),

("Boston", 2016, 150),

("Toronto", 2017, 50)

).toDF("city", "year", "amount")

// very labor-intense

// groupBy's unioned

val groupByCityAndYear = sales

.groupBy("city", "year") // <-- subtotals (city, year)

.agg(sum("amount") as "amount")

val groupByCityOnly = sales

.groupBy("city") // <-- subtotals (city)

.agg(sum("amount") as "amount")

.select($"city", lit(null) as "year", $"amount") // <-- year is null

val withUnion = groupByCityAndYear

.union(groupByCityOnly)

.sort($"city".desc_nulls_last, $"year".asc_nulls_last)

scala> withUnion.show

+-------+----+------+

| city|year|amount|

+-------+----+------+

| Warsaw|2016| 100|

| Warsaw|2017| 200|

| Warsaw|null| 300|

|Toronto|2017| 50|

|Toronto|null| 50|

| Boston|2015| 50|

| Boston|2016| 150|

| Boston|null| 200|

+-------+----+------+

Multi-dimensional aggregate operators are semantically equivalent to union operator (or SQL’s UNION ALL) to combine single grouping queries.



// Roll up your sleeves!
val withRollup = sales
  .rollup("city", "year")
  .agg(sum("amount") as "amount", grouping_id() as "gid")
  .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
  .filter(grouping_id() =!= 3)
  .select("city", "year", "amount")
scala> withRollup.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
+-------+----+------+

// Be even more smarter?
// SQL only, alas.
sales.createOrReplaceTempView("sales")
val withGroupingSets = sql("""
  SELECT city, year, SUM(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city))
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> withGroupingSets.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
+-------+----+------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

// Roll up your sleeves!

val withRollup = sales

.rollup("city", "year")

.agg(sum("amount") as "amount", grouping_id() as "gid")

.sort($"city".desc_nulls_last, $"year".asc_nulls_last)

.filter(grouping_id() =!= 3)

.select("city", "year", "amount")

scala> withRollup.show

+-------+----+------+

| city|year|amount|

+-------+----+------+

| Warsaw|2016| 100|

| Warsaw|2017| 200|

| Warsaw|null| 300|

|Toronto|2017| 50|

|Toronto|null| 50|

| Boston|2015| 50|

| Boston|2016| 150|

| Boston|null| 200|

+-------+----+------+

// Be even more smarter?

// SQL only, alas.

sales.createOrReplaceTempView("sales")

val withGroupingSets = sql("""

SELECT city, year, SUM(amount) as amount

FROM sales

GROUP BY city, year

GROUPING SETS ((city, year), (city))

ORDER BY city DESC NULLS LAST, year ASC NULLS LAST

""")

scala> withGroupingSets.show

+-------+----+------+

| city|year|amount|

+-------+----+------+

| Warsaw|2016| 100|

| Warsaw|2017| 200|

| Warsaw|null| 300|

|Toronto|2017| 50|

|Toronto|null| 50|

| Boston|2015| 50|

| Boston|2016| 150|

| Boston|null| 200|

+-------+----+------+

Note	It is assumed that using one of the operators is usually more efficient (than `union` and `groupBy`) as it gives more freedom for query optimization.

Table 1. Multi-dimensional Aggregate Operators
Operator	Return Type	Description
cube	RelationalGroupedDataset	Calculates subtotals and a grand total for every permutation of the columns specified.
rollup	RelationalGroupedDataset	Calculates subtotals and a grand total over (ordered) combination of groups.

Beside cube and rollup multi-dimensional aggregate operators, Spark SQL supports GROUPING SETS clause in SQL mode only.

Note	SQL’s `GROUPING SETS` is the most general aggregate “operator” and can generate the same dataset as using a simple groupBy, cube and rollup operators.



import java.time.LocalDate
import java.sql.Date
val expenses = Seq(
  ((2012, Month.DECEMBER, 12), 5),
  ((2016, Month.AUGUST, 13), 10),
  ((2017, Month.MAY, 27), 15))
  .map { case ((yy, mm, dd), a) => (LocalDate.of(yy, mm, dd), a) }
  .map { case (d, a) => (d.toString, a) }
  .map { case (d, a) => (Date.valueOf(d), a) }
  .toDF("date", "amount")
scala> expenses.show
+----------+------+
|      date|amount|
+----------+------+
|2012-12-12|     5|
|2016-08-13|    10|
|2017-05-27|    15|
+----------+------+

// rollup time!
val q = expenses
  .rollup(year($"date") as "year", month($"date") as "month")
  .agg(sum("amount") as "amount")
  .sort($"year".asc_nulls_last, $"month".asc_nulls_last)
scala> q.show
+----+-----+------+
|year|month|amount|
+----+-----+------+
|2012|   12|     5|
|2012| null|     5|
|2016|    8|    10|
|2016| null|    10|
|2017|    5|    15|
|2017| null|    15|
|null| null|    30|
+----+-----+------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

import java.time.LocalDate

import java.sql.Date

val expenses = Seq(

((2012, Month.DECEMBER, 12), 5),

((2016, Month.AUGUST, 13), 10),

((2017, Month.MAY, 27), 15))

.map { case ((yy, mm, dd), a) => (LocalDate.of(yy, mm, dd), a) }

.map { case (d, a) => (d.toString, a) }

.map { case (d, a) => (Date.valueOf(d), a) }

.toDF("date", "amount")

scala> expenses.show

+----------+------+

| date|amount|

+----------+------+

|2012-12-12| 5|

|2016-08-13| 10|

|2017-05-27| 15|

+----------+------+

// rollup time!

val q = expenses

.rollup(year($"date") as "year", month($"date") as "month")

.agg(sum("amount") as "amount")

.sort($"year".asc_nulls_last, $"month".asc_nulls_last)

scala> q.show

+----+-----+------+

|year|month|amount|

+----+-----+------+

|2012| 12| 5|

|2012| null| 5|

|2016| 8| 10|

|2016| null| 10|

|2017| 5| 15|

|2017| null| 15|

|null| null| 30|

+----+-----+------+

Tip	Review the examples per operator in the following sections.

Note	Support for multi-dimensional aggregate operators was added in [SPARK-6356] Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext.

`rollup` Operator



rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset

1

2

3

4

5

6

rollup(cols: Column*): RelationalGroupedDataset

rollup(col1: String, cols: String*): RelationalGroupedDataset

rollup multi-dimensional aggregate operator is an extension of groupBy operator that calculates subtotals and a grand total across specified group of n + 1 dimensions (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).

Note	`rollup` operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total. See PostgreSQL’s 7.2.4. GROUPING SETS, CUBE, and ROLLUP

Note	`rollup` operator is equivalent to `GROUP BY ... WITH ROLLUP` in SQL (which in turn is equivalent to `GROUP BY ... GROUPING SETS ((a,b,c),(a,b),(a),())` when used with 3 columns: `a`, `b`, and `c`).



val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")

val q = sales
  .rollup("city", "year")
  .agg(sum("amount") as "amount")
  .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100| <-- subtotal for Warsaw in 2016
| Warsaw|2017|   200|
| Warsaw|null|   300| <-- subtotal for Warsaw (across years)
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550| <-- grand total
+-------+----+------+

// The above query is semantically equivalent to the following
val q1 = sales
  .groupBy("city", "year")  // <-- subtotals (city, year)
  .agg(sum("amount") as "amount")
val q2 = sales
  .groupBy("city")          // <-- subtotals (city)
  .agg(sum("amount") as "amount")
  .select($"city", lit(null) as "year", $"amount")  // <-- year is null
val q3 = sales
  .groupBy()                // <-- grand total
  .agg(sum("amount") as "amount")
  .select(lit(null) as "city", lit(null) as "year", $"amount")  // <-- city and year are null
val qq = q1
  .union(q2)
  .union(q3)
  .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
scala> qq.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|
+-------+----+------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

val sales = Seq(

("Warsaw", 2016, 100),

("Warsaw", 2017, 200),

("Boston", 2015, 50),

("Boston", 2016, 150),

("Toronto", 2017, 50)

).toDF("city", "year", "amount")

val q = sales

.rollup("city", "year")

.agg(sum("amount") as "amount")

.sort($"city".desc_nulls_last, $"year".asc_nulls_last)

scala> q.show

+-------+----+------+

| city|year|amount|

+-------+----+------+

| Warsaw|2016| 100| <-- subtotal for Warsaw in 2016

| Warsaw|2017| 200|

| Warsaw|null| 300| <-- subtotal for Warsaw (across years)

|Toronto|2017| 50|

|Toronto|null| 50|

| Boston|2015| 50|

| Boston|2016| 150|

| Boston|null| 200|

| null|null| 550| <-- grand total

+-------+----+------+

// The above query is semantically equivalent to the following

val q1 = sales

.groupBy("city", "year") // <-- subtotals (city, year)

.agg(sum("amount") as "amount")

val q2 = sales

.groupBy("city") // <-- subtotals (city)

.agg(sum("amount") as "amount")

.select($"city", lit(null) as "year", $"amount") // <-- year is null

val q3 = sales

.groupBy() // <-- grand total

.agg(sum("amount") as "amount")

.select(lit(null) as "city", lit(null) as "year", $"amount") // <-- city and year are null

val qq = q1

.union(q2)

.union(q3)

.sort($"city".desc_nulls_last, $"year".asc_nulls_last)

scala> qq.show

+-------+----+------+

| city|year|amount|

+-------+----+------+

| Warsaw|2016| 100|

| Warsaw|2017| 200|

| Warsaw|null| 300|

|Toronto|2017| 50|

|Toronto|null| 50|

| Boston|2015| 50|

| Boston|2016| 150|

| Boston|null| 200|

| null|null| 550|

+-------+----+------+

From Using GROUP BY with ROLLUP, CUBE, and GROUPING SETS in Microsoft’s TechNet:

The ROLLUP, CUBE, and GROUPING SETS operators are extensions of the GROUP BY clause. The ROLLUP, CUBE, or GROUPING SETS operators can generate the same result set as when you use UNION ALL to combine single grouping queries; however, using one of the GROUP BY operators is usually more efficient.

From PostgreSQL’s 7.2.4. GROUPING SETS, CUBE, and ROLLUP:

References to the grouping columns or expressions are replaced by null values in result rows for grouping sets in which those columns do not appear.

From Summarizing Data Using ROLLUP in Microsoft’s TechNet:

The ROLLUP operator is useful in generating reports that contain subtotals and totals. (…)
ROLLUP generates a result set that shows aggregates for a hierarchy of values in the selected columns.



// Borrowed from Microsoft's "Summarizing Data Using ROLLUP" article
val inventory = Seq(
  ("table", "blue", 124),
  ("table", "red", 223),
  ("chair", "blue", 101),
  ("chair", "red", 210)).toDF("item", "color", "quantity")

scala> inventory.show
+-----+-----+--------+
| item|color|quantity|
+-----+-----+--------+
|chair| blue|     101|
|chair|  red|     210|
|table| blue|     124|
|table|  red|     223|
+-----+-----+--------+

// ordering and empty rows done manually for demo purposes
scala> inventory.rollup("item", "color").sum().show
+-----+-----+-------------+
| item|color|sum(quantity)|
+-----+-----+-------------+
|chair| blue|          101|
|chair|  red|          210|
|chair| null|          311|
|     |     |             |
|table| blue|          124|
|table|  red|          223|
|table| null|          347|
|     |     |             |
| null| null|          658|
+-----+-----+-------------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

// Borrowed from Microsoft's "Summarizing Data Using ROLLUP" article

val inventory = Seq(

("table", "blue", 124),

("table", "red", 223),

("chair", "blue", 101),

("chair", "red", 210)).toDF("item", "color", "quantity")

scala> inventory.show

+-----+-----+--------+

| item|color|quantity|

+-----+-----+--------+

|chair| blue| 101|

|chair| red| 210|

|table| blue| 124|

|table| red| 223|

+-----+-----+--------+

// ordering and empty rows done manually for demo purposes

scala> inventory.rollup("item", "color").sum().show

+-----+-----+-------------+

| item|color|sum(quantity)|

+-----+-----+-------------+

|chair| blue| 101|

|chair| red| 210|

|chair| null| 311|

| | | |

|table| blue| 124|

|table| red| 223|

|table| null| 347|

| | | |

| null| null| 658|

+-----+-----+-------------+

From Hive’s Cubes and Rollups:

WITH ROLLUP is used with the GROUP BY only. ROLLUP clause is used with GROUP BY to compute the aggregate at the hierarchy levels of a dimension.

GROUP BY a, b, c with ROLLUP assumes that the hierarchy is “a” drilling down to “b” drilling down to “c”.

GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).

Note	Read up on ROLLUP in Hive’s LanguageManual in Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function.



// Borrowed from http://stackoverflow.com/a/27222655/1305344
val quarterlyScores = Seq(
  ("winter2014", "Agata", 99),
  ("winter2014", "Jacek", 97),
  ("summer2015", "Agata", 100),
  ("summer2015", "Jacek", 63),
  ("winter2015", "Agata", 97),
  ("winter2015", "Jacek", 55),
  ("summer2016", "Agata", 98),
  ("summer2016", "Jacek", 97)).toDF("period", "student", "score")

scala> quarterlyScores.show
+----------+-------+-----+
|    period|student|score|
+----------+-------+-----+
|winter2014|  Agata|   99|
|winter2014|  Jacek|   97|
|summer2015|  Agata|  100|
|summer2015|  Jacek|   63|
|winter2015|  Agata|   97|
|winter2015|  Jacek|   55|
|summer2016|  Agata|   98|
|summer2016|  Jacek|   97|
+----------+-------+-----+

// ordering and empty rows done manually for demo purposes
scala> quarterlyScores.rollup("period", "student").sum("score").show
+----------+-------+----------+
|    period|student|sum(score)|
+----------+-------+----------+
|winter2014|  Agata|        99|
|winter2014|  Jacek|        97|
|winter2014|   null|       196|
|          |       |          |
|summer2015|  Agata|       100|
|summer2015|  Jacek|        63|
|summer2015|   null|       163|
|          |       |          |
|winter2015|  Agata|        97|
|winter2015|  Jacek|        55|
|winter2015|   null|       152|
|          |       |          |
|summer2016|  Agata|        98|
|summer2016|  Jacek|        97|
|summer2016|   null|       195|
|          |       |          |
|      null|   null|       706|
+----------+-------+----------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

// Borrowed from http://stackoverflow.com/a/27222655/1305344

val quarterlyScores = Seq(

("winter2014", "Agata", 99),

("winter2014", "Jacek", 97),

("summer2015", "Agata", 100),

("summer2015", "Jacek", 63),

("winter2015", "Agata", 97),

("winter2015", "Jacek", 55),

("summer2016", "Agata", 98),

("summer2016", "Jacek", 97)).toDF("period", "student", "score")

scala> quarterlyScores.show

+----------+-------+-----+

| period|student|score|

+----------+-------+-----+

|winter2014| Agata| 99|

|winter2014| Jacek| 97|

|summer2015| Agata| 100|

|summer2015| Jacek| 63|

|winter2015| Agata| 97|

|winter2015| Jacek| 55|

|summer2016| Agata| 98|

|summer2016| Jacek| 97|

+----------+-------+-----+

// ordering and empty rows done manually for demo purposes

scala> quarterlyScores.rollup("period", "student").sum("score").show

+----------+-------+----------+

| period|student|sum(score)|

+----------+-------+----------+

|winter2014| Agata| 99|

|winter2014| Jacek| 97|

|winter2014| null| 196|

| | | |

|summer2015| Agata| 100|

|summer2015| Jacek| 63|

|summer2015| null| 163|

| | | |

|winter2015| Agata| 97|

|winter2015| Jacek| 55|

|winter2015| null| 152|

| | | |

|summer2016| Agata| 98|

|summer2016| Jacek| 97|

|summer2016| null| 195|

| | | |

| null| null| 706|

+----------+-------+----------+

From PostgreSQL’s 7.2.4. GROUPING SETS, CUBE, and ROLLUP:

The individual elements of a CUBE or ROLLUP clause may be either individual expressions, or sublists of elements in parentheses. In the latter case, the sublists are treated as single units for the purposes of generating the individual grouping sets.



// given the above inventory dataset

// using struct function
scala> inventory.rollup(struct("item", "color") as "(item,color)").sum().show
+------------+-------------+
|(item,color)|sum(quantity)|
+------------+-------------+
| [table,red]|          223|
|[chair,blue]|          101|
|        null|          658|
| [chair,red]|          210|
|[table,blue]|          124|
+------------+-------------+

// using expr function
scala> inventory.rollup(expr("(item, color)") as "(item, color)").sum().show
+-------------+-------------+
|(item, color)|sum(quantity)|
+-------------+-------------+
|  [table,red]|          223|
| [chair,blue]|          101|
|         null|          658|
|  [chair,red]|          210|
| [table,blue]|          124|
+-------------+-------------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

// given the above inventory dataset

// using struct function

scala> inventory.rollup(struct("item", "color") as "(item,color)").sum().show

+------------+-------------+

|(item,color)|sum(quantity)|

+------------+-------------+

| [table,red]| 223|

|[chair,blue]| 101|

| null| 658|

| [chair,red]| 210|

|[table,blue]| 124|

+------------+-------------+

// using expr function

scala> inventory.rollup(expr("(item, color)") as "(item, color)").sum().show

+-------------+-------------+

|(item, color)|sum(quantity)|

+-------------+-------------+

| [table,red]| 223|

| [chair,blue]| 101|

| null| 658|

| [chair,red]| 210|

| [table,blue]| 124|

+-------------+-------------+

Internally, rollup converts the Dataset into a DataFrame (i.e. uses RowEncoder as the encoder) and then creates a RelationalGroupedDataset (with RollupType group type).

Note	Rollup expression represents `GROUP BY ... WITH ROLLUP` in SQL in Spark’s Catalyst Expression tree (after `AstBuilder` parses a structured query with aggregation).

Tip	Read up on `rollup` in Deeper into Postgres 9.5 – New Group By Options for Aggregation.

`cube` Operator



cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset

1

2

3

4

5

6

cube(cols: Column*): RelationalGroupedDataset

cube(col1: String, cols: String*): RelationalGroupedDataset

cube multi-dimensional aggregate operator is an extension of groupBy operator that allows calculating subtotals and a grand total across all combinations of specified group of n + 1 dimensions (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).

cube returns RelationalGroupedDataset that you can use to execute aggregate function or operator.

Note	`cube` is more than rollup operator, i.e. `cube` does `rollup` with aggregation over all the missing combinations given the columns.



val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")

val q = sales.cube("city", "year")
  .agg(sum("amount") as "amount")
  .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|  <-- total in Warsaw in 2016
| Warsaw|2017|   200|  <-- total in Warsaw in 2017
| Warsaw|null|   300|  <-- total in Warsaw (across all years)
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total in 2015 (across all cities)
|   null|2016|   250|
|   null|2017|   250|
|   null|null|   550|  <-- grand total (across cities and years)
+-------+----+------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

val sales = Seq(

("Warsaw", 2016, 100),

("Warsaw", 2017, 200),

("Boston", 2015, 50),

("Boston", 2016, 150),

("Toronto", 2017, 50)

).toDF("city", "year", "amount")

val q = sales.cube("city", "year")

.agg(sum("amount") as "amount")

.sort($"city".desc_nulls_last, $"year".asc_nulls_last)

scala> q.show

+-------+----+------+

| city|year|amount|

+-------+----+------+

| Warsaw|2016| 100| <-- total in Warsaw in 2016

| Warsaw|2017| 200| <-- total in Warsaw in 2017

| Warsaw|null| 300| <-- total in Warsaw (across all years)

|Toronto|2017| 50|

|Toronto|null| 50|

| Boston|2015| 50|

| Boston|2016| 150|

| Boston|null| 200|

| null|2015| 50| <-- total in 2015 (across all cities)

| null|2016| 250|

| null|2017| 250|

| null|null| 550| <-- grand total (across cities and years)

+-------+----+------+

GROUPING SETS SQL Clause



GROUP BY ... GROUPING SETS (...)

1

2

3

4

5

GROUP BY ... GROUPING SETS (...)

GROUPING SETS clause generates a dataset that is equivalent to union operator of multiple groupBy operators.



val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|  <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), (year), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total across all cities in 2015
|   null|2016|   250|  <-- total across all cities in 2016
|   null|2017|   250|  <-- total across all cities in 2017
|   null|null|   550|
+-------+----+------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

val sales = Seq(

("Warsaw", 2016, 100),

("Warsaw", 2017, 200),

("Boston", 2015, 50),

("Boston", 2016, 150),

("Toronto", 2017, 50)

).toDF("city", "year", "amount")

sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")

val q = sql("""

SELECT city, year, sum(amount) as amount

FROM sales

GROUP BY city, year

GROUPING SETS ((city, year), (city), ())

ORDER BY city DESC NULLS LAST, year ASC NULLS LAST

""")

scala> q.show

+-------+----+------+

| city|year|amount|

+-------+----+------+

| Warsaw|2016| 100|

| Warsaw|2017| 200|

| Warsaw|null| 300|

|Toronto|2017| 50|

|Toronto|null| 50|

| Boston|2015| 50|

| Boston|2016| 150|

| Boston|null| 200|

| null|null| 550| <-- grand total across all cities and years

+-------+----+------+

// equivalent to cube("city", "year")

// note the additional (year) grouping set

val q = sql("""

SELECT city, year, sum(amount) as amount

FROM sales

GROUP BY city, year

GROUPING SETS ((city, year), (city), (year), ())

ORDER BY city DESC NULLS LAST, year ASC NULLS LAST

""")

scala> q.show

+-------+----+------+

| city|year|amount|

+-------+----+------+

| Warsaw|2016| 100|

| Warsaw|2017| 200|

| Warsaw|null| 300|

|Toronto|2017| 50|

|Toronto|null| 50|

| Boston|2015| 50|

| Boston|2016| 150|

| Boston|null| 200|

| null|2015| 50| <-- total across all cities in 2015

| null|2016| 250| <-- total across all cities in 2016

| null|2017| 250| <-- total across all cities in 2017

| null|null| 550|

+-------+----+------+

Internally, GROUPING SETS clause is parsed in withAggregation parsing handler (in AstBuilder) and becomes a GroupingSets logical operator internally.

`Rollup` GroupingSet with CodegenFallback Expression (for `rollup` Operator)



Rollup(groupByExprs: Seq[Expression])
extends GroupingSet

1

2

3

4

5

6

Rollup(groupByExprs: Seq[Expression])

extends GroupingSet

Rollup expression represents rollup operator in Spark’s Catalyst Expression tree (after AstBuilder parses a structured query with aggregation).

Note	`GroupingSet` is an Expression with CodegenFallback support.

Data Types

2012-02-22admin阅读(2380)

Data Types

DataType abstract class is the base type of all built-in data types in Spark SQL, e.g. strings, longs.

DataType has two main type families:

Atomic Types as an internal type to represent types that are not null, UDTs, arrays, structs, and maps
Numeric Types with fractional and integral types

Table 1. Standard Data Types
Type Family	Data Type	Scala Types
Atomic Types (except fractional and integral types)	`BinaryType`
	`BooleanType`
	`DateType`
	`StringType`
	`TimestampType`	`java.sql.Timestamp`
Fractional Types (concrete NumericType)	`DecimalType`
	`DoubleType`
	`FloatType`
Integral Types (concrete NumericType)	`ByteType`
	`IntegerType`
	`LongType`
	`ShortType`
	`ArrayType`
	`CalendarIntervalType`
	`MapType`
	`NullType`
	`ObjectType`
	StructType
	`UserDefinedType`
	`AnyDataType`	Matches any concrete data type

Caution

FIXME What about AbstractDataType?

You can extend the type system and create your own user-defined types (UDTs).

The DataType Contract defines methods to build SQL, JSON and string representations.

Note	`DataType` (and the concrete Spark SQL types) live in `org.apache.spark.sql.types` package.



import org.apache.spark.sql.types.StringType

scala> StringType.json
res0: String = "string"

scala> StringType.sql
res1: String = STRING

scala> StringType.catalogString
res2: String = string

1

2

3

4

5

6

7

8

9

10

11

12

13

14

import org.apache.spark.sql.types.StringType

scala> StringType.json

res0: String = "string"

scala> StringType.sql

res1: String = STRING

scala> StringType.catalogString

res2: String = string

You should use DataTypes object in your code to create complex Spark SQL types, i.e. arrays or maps.



import org.apache.spark.sql.types.DataTypes

scala> val arrayType = DataTypes.createArrayType(BooleanType)
arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(BooleanType,true)

scala> val mapType = DataTypes.createMapType(StringType, LongType)
mapType: org.apache.spark.sql.types.MapType = MapType(StringType,LongType,true)

1

2

3

4

5

6

7

8

9

10

11

import org.apache.spark.sql.types.DataTypes

scala> val arrayType = DataTypes.createArrayType(BooleanType)

arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(BooleanType,true)

scala> val mapType = DataTypes.createMapType(StringType, LongType)

mapType: org.apache.spark.sql.types.MapType = MapType(StringType,LongType,true)

DataType has support for Scala’s pattern matching using unapply method.

???

1

2

3

4

5

???

DataType Contract

Any type in Spark SQL follows the DataType contract which means that the types define the following methods:

json and prettyJson to build JSON representations of a data type
defaultSize to know the default size of values of a type
simpleString and catalogString to build user-friendly string representations (with the latter for external catalogs)
sql to build SQL representation



import org.apache.spark.sql.types.DataTypes._

val maps = StructType(
  StructField("longs2strings", createMapType(LongType, StringType), false) :: Nil)

scala> maps.prettyJson
res0: String =
{
  "type" : "struct",
  "fields" : [ {
    "name" : "longs2strings",
    "type" : {
      "type" : "map",
      "keyType" : "long",
      "valueType" : "string",
      "valueContainsNull" : true
    },
    "nullable" : false,
    "metadata" : { }
  } ]
}

scala> maps.defaultSize
res1: Int = 2800

scala> maps.simpleString
res2: String = struct<longs2strings:map<bigint,string>>

scala> maps.catalogString
res3: String = struct<longs2strings:map<bigint,string>>

scala> maps.sql
res4: String = STRUCT<`longs2strings`: MAP<BIGINT, STRING>>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

import org.apache.spark.sql.types.DataTypes._

val maps = StructType(

StructField("longs2strings", createMapType(LongType, StringType), false) :: Nil)

scala> maps.prettyJson

res0: String =

{

"type" : "struct",

"fields" : [ {

"name" : "longs2strings",

"type" : {

"type" : "map",

"keyType" : "long",

"valueType" : "string",

"valueContainsNull" : true

},

"nullable" : false,

"metadata" : { }

} ]

}

scala> maps.defaultSize

res1: Int = 2800

scala> maps.simpleString

res2: String = struct<longs2strings:map<bigint,string>>

scala> maps.catalogString

res3: String = struct<longs2strings:map<bigint,string>>

scala> maps.sql

res4: String = STRUCT<`longs2strings`: MAP<BIGINT, STRING>>

DataTypes — Factory Methods for Data Types

DataTypes is a Java class with methods to access simple or create complex DataType types in Spark SQL, i.e. arrays and maps.

Tip	It is recommended to use `DataTypes` class to define `DataType` types in a schema.

DataTypes lives in org.apache.spark.sql.types package.



import org.apache.spark.sql.types.DataTypes

scala> val arrayType = DataTypes.createArrayType(BooleanType)
arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(BooleanType,true)

scala> val mapType = DataTypes.createMapType(StringType, LongType)
mapType: org.apache.spark.sql.types.MapType = MapType(StringType,LongType,true)

1

2

3

4

5

6

7

8

9

10

11

import org.apache.spark.sql.types.DataTypes

scala> val arrayType = DataTypes.createArrayType(BooleanType)

arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(BooleanType,true)

scala> val mapType = DataTypes.createMapType(StringType, LongType)

mapType: org.apache.spark.sql.types.MapType = MapType(StringType,LongType,true)

Note

Simple DataType types themselves, i.e. StringType or CalendarIntervalType, come with their own Scala’s case objects alongside their definitions.

You may also import the types package and have access to the types.



import org.apache.spark.sql.types._

1

2

3

4

5

import org.apache.spark.sql.types._

UDTs — User-Defined Types

Caution

FIXME

StructField — Single Field in StructType

2012-02-21admin阅读(2069)

StructField — Single Field in StructType

StructField describes a single field in a StructType with the following:

Name
DataType
nullable flag (enabled by default)
Metadata (empty by default)

A comment is part of metadata under comment key and is used to build a Hive column or when describing a table.



scala> schemaTyped("a").getComment
res0: Option[String] = None

scala> schemaTyped("a").withComment("this is a comment").getComment
res1: Option[String] = Some(this is a comment)

1

2

3

4

5

6

7

8

9

scala> schemaTyped("a").getComment

res0: Option[String] = None

scala> schemaTyped("a").withComment("this is a comment").getComment

res1: Option[String] = Some(this is a comment)

As of Spark 2.4.0, StructField can be converted to DDL format using toDDL method.


Example: Using StructField.toDDL

import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder()
  .putString("comment", "this is a comment")
  .build
import org.apache.spark.sql.types.{LongType, StructField}
val f = new StructField(name = "id", dataType = LongType, nullable = false, metadata)
scala> println(f.toDDL)
`id` BIGINT COMMENT 'this is a comment'

1

2

3

4

5

6

7

8

9

10

11

12

13

Example: Using StructField.toDDL

import org.apache.spark.sql.types.MetadataBuilder

val metadata = new MetadataBuilder()

.putString("comment", "this is a comment")

.build

import org.apache.spark.sql.types.{LongType, StructField}

val f = new StructField(name = "id", dataType = LongType, nullable = false, metadata)

scala> println(f.toDDL)

`id` BIGINT COMMENT 'this is a comment'

Converting to DDL Format — `toDDL` Method



toDDL: String

1

2

3

4

5

toDDL: String

toDDL gives a text in the format:



[quoted name] [dataType][optional comment]

1

2

3

4

5

[quoted name] [dataType][optional comment]

Note	`toDDL` is used when: `StructType` is requested to convert itself to DDL format ShowCreateTableCommand logical command is executed (and showHiveTableHeader, showHiveTableNonDataColumns, showDataSourceTableDataColumns)

StructType

2012-02-20admin阅读(2138)

StructType — Data Type for Schema Definition

StructType is a built-in data type that is a collection of StructFields.

StructType is used to define a schema or its part.

You can compare two StructType instances to see whether they are equal.



import org.apache.spark.sql.types.StructType

val schemaUntyped = new StructType()
  .add("a", "int")
  .add("b", "string")

import org.apache.spark.sql.types.{IntegerType, StringType}
val schemaTyped = new StructType()
  .add("a", IntegerType)
  .add("b", StringType)

scala> schemaUntyped == schemaTyped
res0: Boolean = true

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

import org.apache.spark.sql.types.StructType

val schemaUntyped = new StructType()

.add("a", "int")

.add("b", "string")

import org.apache.spark.sql.types.{IntegerType, StringType}

val schemaTyped = new StructType()

.add("a", IntegerType)

.add("b", StringType)

scala> schemaUntyped == schemaTyped

res0: Boolean = true

StructType presents itself as <struct> or STRUCT in query plans or SQL.

Note

StructType is a Seq[StructField] and therefore all things Seq apply equally here.



scala> schemaTyped.foreach(println)
StructField(a,IntegerType,true)
StructField(b,StringType,true)

1

2

3

4

5

6

7

scala> schemaTyped.foreach(println)

StructField(a,IntegerType,true)

StructField(b,StringType,true)

Read the official documentation of Scala’s scala.collection.Seq.

As of Spark 2.4.0, StructType can be converted to DDL format using toDDL method.


Example: Using StructType.toDDL

// Generating a schema from a case class
// Because we're all properly lazy
case class Person(id: Long, name: String)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
scala> println(schema.toDDL)
`id` BIGINT,`name` STRING

1

2

3

4

5

6

7

8

9

10

11

12

Example: Using StructType.toDDL

// Generating a schema from a case class

// Because we're all properly lazy

case class Person(id: Long, name: String)

import org.apache.spark.sql.Encoders

val schema = Encoders.product[Person].schema

scala> println(schema.toDDL)

`id` BIGINT,`name` STRING

`fromAttributes` Method



fromAttributes(attributes: Seq[Attribute]): StructType

1

2

3

4

5

fromAttributes(attributes: Seq[Attribute]): StructType

fromAttributes…FIXME

Note	`fromAttributes` is used when…FIXME

`toAttributes` Method



toAttributes: Seq[AttributeReference]

1

2

3

4

5

toAttributes: Seq[AttributeReference]

toAttributes…FIXME

Note	`toAttributes` is used when…FIXME

Adding Fields to Schema — `add` Method

You can add a new StructField to your StructType. There are different variants of add method that all make for a new StructType with the field added.



add(field: StructField): StructType
add(name: String, dataType: DataType): StructType
add(name: String, dataType: DataType, nullable: Boolean): StructType
add(
  name: String,
  dataType: DataType,
  nullable: Boolean,
  metadata: Metadata): StructType
add(
  name: String,
  dataType: DataType,
  nullable: Boolean,
  comment: String): StructType
add(name: String, dataType: String): StructType
add(name: String, dataType: String, nullable: Boolean): StructType
add(
  name: String,
  dataType: String,
  nullable: Boolean,
  metadata: Metadata): StructType
add(
  name: String,
  dataType: String,
  nullable: Boolean,
  comment: String): StructType

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

add(field: StructField): StructType

add(name: String, dataType: DataType): StructType

add(name: String, dataType: DataType, nullable: Boolean): StructType

add(

dataType: DataType,

nullable: Boolean,

metadata: Metadata): StructType

add(

dataType: DataType,

nullable: Boolean,

comment: String): StructType

add(name: String, dataType: String): StructType

add(name: String, dataType: String, nullable: Boolean): StructType

add(

dataType: String,

nullable: Boolean,

metadata: Metadata): StructType

add(

dataType: String,

nullable: Boolean,

comment: String): StructType

DataType Name Conversions



simpleString: String
catalogString: String
sql: String

1

2

3

4

5

6

7

simpleString: String

catalogString: String

sql: String

StructType as a custom DataType is used in query plans or SQL. It can present itself using simpleString, catalogString or sql (see DataType Contract).



scala> schemaTyped.simpleString
res0: String = struct<a:int,b:string>

scala> schemaTyped.catalogString
res1: String = struct<a:int,b:string>

scala> schemaTyped.sql
res2: String = STRUCT<`a`: INT, `b`: STRING>

1

2

3

4

5

6

7

8

9

10

11

12

scala> schemaTyped.simpleString

res0: String = struct<a:int,b:string>

scala> schemaTyped.catalogString

res1: String = struct<a:int,b:string>

scala> schemaTyped.sql

res2: String = STRUCT<`a`: INT, `b`: STRING>

Accessing StructField — `apply` Method



apply(name: String): StructField

1

2

3

4

5

apply(name: String): StructField

StructType defines its own apply method that gives you an easy access to a StructField by name.



scala> schemaTyped.printTreeString
root
 |-- a: integer (nullable = true)
 |-- b: string (nullable = true)

scala> schemaTyped("a")
res4: org.apache.spark.sql.types.StructField = StructField(a,IntegerType,true)

1

2

3

4

5

6

7

8

9

10

11

scala> schemaTyped.printTreeString

root

|-- a: integer (nullable = true)

|-- b: string (nullable = true)

scala> schemaTyped("a")

res4: org.apache.spark.sql.types.StructField = StructField(a,IntegerType,true)

Creating StructType from Existing StructType — `apply` Method



apply(names: Set[String]): StructType

1

2

3

4

5

apply(names: Set[String]): StructType

This variant of apply lets you create a StructType out of an existing StructType with the names only.



scala> schemaTyped(names = Set("a"))
res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

1

2

3

4

5

6

scala> schemaTyped(names = Set("a"))

res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

It will throw an IllegalArgumentException exception when a field could not be found.



scala> schemaTyped(names = Set("a", "c"))
java.lang.IllegalArgumentException: Field c does not exist.
  at org.apache.spark.sql.types.StructType.apply(StructType.scala:275)
  ... 48 elided

1

2

3

4

5

6

7

8

scala> schemaTyped(names = Set("a", "c"))

java.lang.IllegalArgumentException: Field c does not exist.

at org.apache.spark.sql.types.StructType.apply(StructType.scala:275)

... 48 elided

Displaying Schema As Tree — `printTreeString` Method



printTreeString(): Unit

1

2

3

4

5

printTreeString(): Unit

printTreeString prints out the schema to standard output.



scala> schemaTyped.printTreeString
root
 |-- a: integer (nullable = true)
 |-- b: string (nullable = true)

1

2

3

4

5

6

7

8

scala> schemaTyped.printTreeString

root

|-- a: integer (nullable = true)

|-- b: string (nullable = true)

Internally, it uses treeString method to build the tree and then println it.

Creating StructType For DDL-Formatted Text — `fromDDL` Object Method



fromDDL(ddl: String): StructType

1

2

3

4

5

fromDDL(ddl: String): StructType

fromDDL…FIXME

Note	`fromDDL` is used when…FIXME

Converting to DDL Format — `toDDL` Method



toDDL: String

1

2

3

4

5

toDDL: String

toDDL converts all the fields to DDL format and concatenates them using the comma (,).

Schema — Structure of Data

2012-02-19admin阅读(1494)

Schema — Structure of Data

A schema is the description of the structure of your data (which together create a Dataset in Spark SQL). It can be implicit (and inferred at runtime) or explicit (and known at compile time).

A schema is described using StructType which is a collection of StructField objects (that in turn are tuples of names, types, and nullability classifier).

StructType and StructField belong to the org.apache.spark.sql.types package.



import org.apache.spark.sql.types.StructType
val schemaUntyped = new StructType()
  .add("a", "int")
  .add("b", "string")

// alternatively using Schema DSL
val schemaUntyped_2 = new StructType()
  .add($"a".int)
  .add($"b".string)

1

2

3

4

5

6

7

8

9

10

11

12

13

import org.apache.spark.sql.types.StructType

val schemaUntyped = new StructType()

.add("a", "int")

.add("b", "string")

// alternatively using Schema DSL

val schemaUntyped_2 = new StructType()

.add($"a".int)

.add($"b".string)

You can use the canonical string representation of SQL types to describe the types in a schema (that is inherently untyped at compile type) or use type-safe types from the org.apache.spark.sql.types package.



// it is equivalent to the above expressions
import org.apache.spark.sql.types.{IntegerType, StringType}
val schemaTyped = new StructType()
  .add("a", IntegerType)
  .add("b", StringType)

1

2

3

4

5

6

7

8

9

// it is equivalent to the above expressions

import org.apache.spark.sql.types.{IntegerType, StringType}

val schemaTyped = new StructType()

.add("a", IntegerType)

.add("b", StringType)

Tip	Read up on CatalystSqlParser that is responsible for parsing data types.

It is however recommended to use the singleton DataTypes class with static methods to create schema types.



import org.apache.spark.sql.types.DataTypes._
val schemaWithMap = StructType(
  StructField("map", createMapType(LongType, StringType), false) :: Nil)

1

2

3

4

5

6

7

import org.apache.spark.sql.types.DataTypes._

val schemaWithMap = StructType(

StructField("map", createMapType(LongType, StringType), false) :: Nil)

StructType offers printTreeString that makes presenting the schema more user-friendly.



scala> schemaTyped.printTreeString
root
 |-- a: integer (nullable = true)
 |-- b: string (nullable = true)

scala> schemaWithMap.printTreeString
root
|-- map: map (nullable = false)
|    |-- key: long
|    |-- value: string (valueContainsNull = true)

// You can use prettyJson method on any DataType
scala> println(schema1.prettyJson)
{
 "type" : "struct",
 "fields" : [ {
   "name" : "a",
   "type" : "integer",
   "nullable" : true,
   "metadata" : { }
 }, {
   "name" : "b",
   "type" : "string",
   "nullable" : true,
   "metadata" : { }
 } ]
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

scala> schemaTyped.printTreeString

root

|-- a: integer (nullable = true)

|-- b: string (nullable = true)

scala> schemaWithMap.printTreeString

root

|-- map: map (nullable = false)

| |-- key: long

| |-- value: string (valueContainsNull = true)

// You can use prettyJson method on any DataType

scala> println(schema1.prettyJson)

{

"type" : "struct",

"fields" : [ {

"name" : "a",

"type" : "integer",

"nullable" : true,

"metadata" : { }

}, {

"name" : "b",

"type" : "string",

"nullable" : true,

"metadata" : { }

} ]

}

As of Spark 2.0, you can describe the schema of your strongly-typed datasets using encoders.



import org.apache.spark.sql.Encoders

scala> Encoders.INT.schema.printTreeString
root
 |-- value: integer (nullable = true)

scala> Encoders.product[(String, java.sql.Timestamp)].schema.printTreeString
root
|-- _1: string (nullable = true)
|-- _2: timestamp (nullable = true)

case class Person(id: Long, name: String)
scala> Encoders.product[Person].schema.printTreeString
root
 |-- id: long (nullable = false)
 |-- name: string (nullable = true)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

import org.apache.spark.sql.Encoders

scala> Encoders.INT.schema.printTreeString

root

|-- value: integer (nullable = true)

scala> Encoders.product[(String, java.sql.Timestamp)].schema.printTreeString

root

|-- _1: string (nullable = true)

|-- _2: timestamp (nullable = true)

case class Person(id: Long, name: String)

scala> Encoders.product[Person].schema.printTreeString

root

|-- id: long (nullable = false)

|-- name: string (nullable = true)

Implicit Schema



val df = Seq((0, s"""hello\tworld"""), (1, "two  spaces inside")).toDF("label", "sentence")

scala> df.printSchema
root
 |-- label: integer (nullable = false)
 |-- sentence: string (nullable = true)

scala> df.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(label,IntegerType,false), StructField(sentence,StringType,true))

scala> df.schema("label").dataType
res1: org.apache.spark.sql.types.DataType = IntegerType

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

val df = Seq((0, s"""hello\tworld"""), (1, "two spaces inside")).toDF("label", "sentence")

scala> df.printSchema

root

|-- label: integer (nullable = false)

|-- sentence: string (nullable = true)

scala> df.schema

res0: org.apache.spark.sql.types.StructType = StructType(StructField(label,IntegerType,false), StructField(sentence,StringType,true))

scala> df.schema("label").dataType

res1: org.apache.spark.sql.types.DataType = IntegerType

spark-sql 第53页

Aggregator — Contract for User-Defined Typed Aggregate Functions (UDAFs)

Converting Aggregator to TypedColumn — toColumn Method

UserDefinedAggregateFunction — Contract for User-Defined Untyped Aggregate Functions (UDAFs)

UserDefinedAggregateFunction Contract

Creating Column for UDAF — apply Method

Creating Column for UDAF with Distinct Values — distinct Method

Dataset Checkpointing

Specyfing Checkpoint Directory — SparkContext.setCheckpointDir Method

Recovering RDD From Checkpoint Files — SparkContext.checkpointFile Method

User-Friendly Names Of Cached Queries in web UI’s Storage Tab

Dataset Caching and Persistence

SQL’s CACHE TABLE

Multi-Dimensional Aggregation

rollup Operator

cube Operator

GROUPING SETS SQL Clause

Rollup GroupingSet with CodegenFallback Expression (for rollup Operator)

Data Types

DataType Contract

DataTypes — Factory Methods for Data Types

UDTs — User-Defined Types

StructField — Single Field in StructType

Converting to DDL Format — toDDL Method

StructType — Data Type for Schema Definition

fromAttributes Method

toAttributes Method

Adding Fields to Schema — add Method

DataType Name Conversions

Accessing StructField — apply Method

Creating StructType from Existing StructType — apply Method

Displaying Schema As Tree — printTreeString Method

Creating StructType For DDL-Formatted Text — fromDDL Object Method

Converting to DDL Format — toDDL Method

Schema — Structure of Data

Implicit Schema

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Converting Aggregator to TypedColumn — `toColumn` Method

Creating Column for UDAF — `apply` Method

Creating Column for UDAF with Distinct Values — `distinct` Method

Specyfing Checkpoint Directory — `SparkContext.setCheckpointDir` Method

Recovering RDD From Checkpoint Files — `SparkContext.checkpointFile` Method

`rollup` Operator

`cube` Operator

`Rollup` GroupingSet with CodegenFallback Expression (for `rollup` Operator)

Converting to DDL Format — `toDDL` Method

`fromAttributes` Method

`toAttributes` Method

Adding Fields to Schema — `add` Method

Accessing StructField — `apply` Method

Creating StructType from Existing StructType — `apply` Method

Displaying Schema As Tree — `printTreeString` Method

Creating StructType For DDL-Formatted Text — `fromDDL` Object Method

Converting to DDL Format — `toDDL` Method