spark-sql-spark技术分享-第55页

Broadcast Joins (aka Map-Side Joins)

2012-02-08admin阅读(1568)

Broadcast Joins (aka Map-Side Joins)

Spark SQL uses broadcast join (aka broadcast hash join) instead of hash join to optimize join queries when the size of one side data is below spark.sql.autoBroadcastJoinThreshold.

Broadcast join can be very efficient for joins between a large table (fact) with relatively small tables (dimensions) that could then be used to perform a star-schema join. It can avoid sending all data of the large table over the network.

You can use broadcast function or SQL’s broadcast hints to mark a dataset to be broadcast when used in a join query.

Note	According to the article Map-Side Join in Spark, broadcast join is also called a replicated join (in the distributed system community) or a map-side join (in the Hadoop community).

CanBroadcast object matches a LogicalPlan with output small enough for broadcast join.

Note	Currently statistics are only supported for Hive Metastore tables where the command `ANALYZE TABLE [tableName] COMPUTE STATISTICS noscan` has been run.

JoinSelection execution planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join.



val threshold =  spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt
scala> threshold / 1024 / 1024
res0: Int = 10

val q = spark.range(100).as("a").join(spark.range(100).as("b")).where($"a.id" === $"b.id")
scala> println(q.queryExecution.logical.numberedTreeString)
00 'Filter ('a.id = 'b.id)
01 +- Join Inner
02    :- SubqueryAlias a
03    :  +- Range (0, 100, step=1, splits=Some(8))
04    +- SubqueryAlias b
05       +- Range (0, 100, step=1, splits=Some(8))

scala> println(q.queryExecution.sparkPlan.numberedTreeString)
00 BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
01 :- Range (0, 100, step=1, splits=8)
02 +- Range (0, 100, step=1, splits=8)

scala> q.explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 100, step=1, splits=8)

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
res1: String = -1

scala> q.explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#4L], Inner
:- *Sort [id#0L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(id#0L, 200)
:     +- *Range (0, 100, step=1, splits=8)
+- *Sort [id#4L ASC NULLS FIRST], false, 0
   +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)

// Force BroadcastHashJoin with broadcast hint (as function)
val qBroadcast = spark.range(100).as("a").join(broadcast(spark.range(100)).as("b")).where($"a.id" === $"b.id")
scala> qBroadcast.explain
== Physical Plan ==
*BroadcastHashJoin [id#14L], [id#18L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 100, step=1, splits=8)

// Force BroadcastHashJoin using SQL's BROADCAST hint
// Supported hints: BROADCAST, BROADCASTJOIN or MAPJOIN
val qBroadcastLeft = """
  SELECT /*+ BROADCAST (lf) */ *
  FROM range(100) lf, range(1000) rt
  WHERE lf.id = rt.id
"""
scala> sql(qBroadcastLeft).explain
== Physical Plan ==
*BroadcastHashJoin [id#34L], [id#35L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 1000, step=1, splits=8)

val qBroadcastRight = """
 SELECT /*+ MAPJOIN (rt) */ *
 FROM range(100) lf, range(1000) rt
 WHERE lf.id = rt.id
"""
scala> sql(qBroadcastRight).explain
== Physical Plan ==
*BroadcastHashJoin [id#42L], [id#43L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 1000, step=1, splits=8)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

val threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt

scala> threshold / 1024 / 1024

res0: Int = 10

val q = spark.range(100).as("a").join(spark.range(100).as("b")).where($"a.id" === $"b.id")

scala> println(q.queryExecution.logical.numberedTreeString)

00 'Filter ('a.id = 'b.id)

01 +- Join Inner

02 :- SubqueryAlias a

03 : +- Range (0, 100, step=1, splits=Some(8))

04 +- SubqueryAlias b

05 +- Range (0, 100, step=1, splits=Some(8))

scala> println(q.queryExecution.sparkPlan.numberedTreeString)

00 BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight

01 :- Range (0, 100, step=1, splits=8)

02 +- Range (0, 100, step=1, splits=8)

scala> q.explain

== Physical Plan ==

*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight

:- *Range (0, 100, step=1, splits=8)

+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))

+- *Range (0, 100, step=1, splits=8)

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")

res1: String = -1

scala> q.explain

== Physical Plan ==

*SortMergeJoin [id#0L], [id#4L], Inner

:- *Sort [id#0L ASC NULLS FIRST], false, 0

: +- Exchange hashpartitioning(id#0L, 200)

: +- *Range (0, 100, step=1, splits=8)

+- *Sort [id#4L ASC NULLS FIRST], false, 0

+- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)

// Force BroadcastHashJoin with broadcast hint (as function)

val qBroadcast = spark.range(100).as("a").join(broadcast(spark.range(100)).as("b")).where($"a.id" === $"b.id")

scala> qBroadcast.explain

== Physical Plan ==

*BroadcastHashJoin [id#14L], [id#18L], Inner, BuildRight

:- *Range (0, 100, step=1, splits=8)

+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))

+- *Range (0, 100, step=1, splits=8)

// Force BroadcastHashJoin using SQL's BROADCAST hint

// Supported hints: BROADCAST, BROADCASTJOIN or MAPJOIN

val qBroadcastLeft = """

SELECT /*+ BROADCAST (lf) */ *

FROM range(100) lf, range(1000) rt

WHERE lf.id = rt.id

"""

scala> sql(qBroadcastLeft).explain

== Physical Plan ==

*BroadcastHashJoin [id#34L], [id#35L], Inner, BuildRight

:- *Range (0, 100, step=1, splits=8)

+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))

+- *Range (0, 1000, step=1, splits=8)

val qBroadcastRight = """

SELECT /*+ MAPJOIN (rt) */ *

FROM range(100) lf, range(1000) rt

WHERE lf.id = rt.id

"""

scala> sql(qBroadcastRight).explain

== Physical Plan ==

*BroadcastHashJoin [id#42L], [id#43L], Inner, BuildRight

:- *Range (0, 100, step=1, splits=8)

+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))

+- *Range (0, 1000, step=1, splits=8)

Dataset Join Operators

2012-02-07admin阅读(1903)

Dataset Join Operators

From PostgreSQL’s 2.6. Joins Between Tables:

Queries can access multiple tables at once, or access the same table in such a way that multiple rows of the table are being processed at the same time. A query that accesses multiple rows of the same or different tables at one time is called a join query.

You can join two datasets using the join operators with an optional join condition.

Table 1. Join Operators
Operator	Return Type	Description
crossJoin	DataFrame	Untyped `Row`-based cross join
join	DataFrame	Untyped `Row`-based join
joinWith	Dataset	Used for a type-preserving join with two output columns for records for which a join condition holds

You can also use SQL mode to join datasets using good ol’ SQL.



val spark: SparkSession = ...
spark.sql("select * from t1, t2 where t1.id = t2.id")

1

2

3

4

5

6

val spark: SparkSession = ...

spark.sql("select * from t1, t2 where t1.id = t2.id")

You can specify a join condition (aka join expression) as part of join operators or using where or filter operators.



df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
df1.join(df2).filter($"df1Key" === $"df2Key")

1

2

3

4

5

6

7

df1.join(df2, $"df1Key" === $"df2Key")

df1.join(df2).where($"df1Key" === $"df2Key")

df1.join(df2).filter($"df1Key" === $"df2Key")

You can specify the join type as part of join operators (using joinType optional parameter).



df1.join(df2, $"df1Key" === $"df2Key", "inner")

1

2

3

4

5

df1.join(df2, $"df1Key" === $"df2Key", "inner")

Table 2. Join Types
SQL	Name (joinType)	JoinType
`CROSS`	`cross`	`Cross`
`INNER`	`inner`	`Inner`
`FULL OUTER`	`outer`, `full`, `fullouter`	`FullOuter`
`LEFT ANTI`	`leftanti`	`LeftAnti`
`LEFT OUTER`	`leftouter`, `left`	`LeftOuter`
`LEFT SEMI`	`leftsemi`	`LeftSemi`
`RIGHT OUTER`	`rightouter`, `right`	`RightOuter`
`NATURAL`	Special case for `Inner`, `LeftOuter`, `RightOuter`, `FullOuter`	`NaturalJoin`
`USING`	Special case for `Inner`, `LeftOuter`, `LeftSemi`, `RightOuter`, `FullOuter`, `LeftAnti`	`UsingJoin`

ExistenceJoin is an artifical join type used to express an existential sub-query, that is often referred to as existential join.

Note	LeftAnti and ExistenceJoin are special cases of LeftOuter.

You can also find that Spark SQL uses the following two families of joins:

InnerLike with Inner and Cross
LeftExistence with LeftSemi, LeftAnti and ExistenceJoin

Tip	Name are case-insensitive and can use the underscore (`_`) at any position, i.e. `left_anti` and `LEFT_ANTI` are equivalent.

Note	Spark SQL offers different join strategies with Broadcast Joins (aka Map-Side Joins) among them that are supposed to optimize your join queries over large distributed datasets.

`join` Operators



join(right: Dataset[_]): DataFrame (1)
join(right: Dataset[_], usingColumn: String): DataFrame (2)
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame (3)
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame (4)
join(right: Dataset[_], joinExprs: Column): DataFrame (5)
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame (6)

1

2

3

4

5

6

7

8

9

10

join(right: Dataset[_]): DataFrame (1)

join(right: Dataset[_], usingColumn: String): DataFrame (2)

join(right: Dataset[_], usingColumns: Seq[String]): DataFrame (3)

join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame (4)

join(right: Dataset[_], joinExprs: Column): DataFrame (5)

join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame (6)

Condition-less inner join
Inner join with a single column that exists on both sides
Inner join with columns that exist on both sides
Equi-join with explicit join type
Inner join
Join with explicit join type. Self-joins are acceptable.

join joins two Datasets.



val left = Seq((0, "zero"), (1, "one")).toDF("id", "left")
val right = Seq((0, "zero"), (2, "two"), (3, "three")).toDF("id", "right")

// Inner join
scala> left.join(right, "id").show
+---+----+-----+
| id|left|right|
+---+----+-----+
|  0|zero| zero|
+---+----+-----+

scala> left.join(right, "id").explain
== Physical Plan ==
*Project [id#50, left#51, right#61]
+- *BroadcastHashJoin [id#50], [id#60], Inner, BuildRight
   :- LocalTableScan [id#50, left#51]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [id#60, right#61]

// Full outer
scala> left.join(right, Seq("id"), "fullouter").show
+---+----+-----+
| id|left|right|
+---+----+-----+
|  1| one| null|
|  3|null|three|
|  2|null|  two|
|  0|zero| zero|
+---+----+-----+

scala> left.join(right, Seq("id"), "fullouter").explain
== Physical Plan ==
*Project [coalesce(id#50, id#60) AS id#85, left#51, right#61]
+- SortMergeJoin [id#50], [id#60], FullOuter
   :- *Sort [id#50 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#50, 200)
   :     +- LocalTableScan [id#50, left#51]
   +- *Sort [id#60 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(id#60, 200)
         +- LocalTableScan [id#60, right#61]

// Left anti
scala> left.join(right, Seq("id"), "leftanti").show
+---+----+
| id|left|
+---+----+
|  1| one|
+---+----+

scala> left.join(right, Seq("id"), "leftanti").explain
== Physical Plan ==
*BroadcastHashJoin [id#50], [id#60], LeftAnti, BuildRight
:- LocalTableScan [id#50, left#51]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- LocalTableScan [id#60]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

val left = Seq((0, "zero"), (1, "one")).toDF("id", "left")

val right = Seq((0, "zero"), (2, "two"), (3, "three")).toDF("id", "right")

// Inner join

scala> left.join(right, "id").show

+---+----+-----+

| id|left|right|

+---+----+-----+

| 0|zero| zero|

+---+----+-----+

scala> left.join(right, "id").explain

== Physical Plan ==

*Project [id#50, left#51, right#61]

+- *BroadcastHashJoin [id#50], [id#60], Inner, BuildRight

:- LocalTableScan [id#50, left#51]

+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))

+- LocalTableScan [id#60, right#61]

// Full outer

scala> left.join(right, Seq("id"), "fullouter").show

+---+----+-----+

| id|left|right|

+---+----+-----+

| 1| one| null|

| 3|null|three|

| 2|null| two|

| 0|zero| zero|

+---+----+-----+

scala> left.join(right, Seq("id"), "fullouter").explain

== Physical Plan ==

*Project [coalesce(id#50, id#60) AS id#85, left#51, right#61]

+- SortMergeJoin [id#50], [id#60], FullOuter

:- *Sort [id#50 ASC NULLS FIRST], false, 0

: +- Exchange hashpartitioning(id#50, 200)

: +- LocalTableScan [id#50, left#51]

+- *Sort [id#60 ASC NULLS FIRST], false, 0

+- Exchange hashpartitioning(id#60, 200)

+- LocalTableScan [id#60, right#61]

// Left anti

scala> left.join(right, Seq("id"), "leftanti").show

+---+----+

| id|left|

+---+----+

| 1| one|

+---+----+

scala> left.join(right, Seq("id"), "leftanti").explain

== Physical Plan ==

*BroadcastHashJoin [id#50], [id#60], LeftAnti, BuildRight

:- LocalTableScan [id#50, left#51]

+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))

+- LocalTableScan [id#60]

Internally, join(right: Dataset[_]) creates a DataFrame with a condition-less Join logical operator (in the current SparkSession).

Note	`join(right: Dataset[_])` creates a logical plan with a condition-less Join operator with two child logical plans of the both sides of the join.

Note	`join(right: Dataset[_], usingColumns: Seq[String], joinType: String)` creates a logical plan with a condition-less Join operator with UsingJoin join type.

Note

join(right: Dataset[_], joinExprs: Column, joinType: String) accepts self-joins where joinExprs is of the form:



df("key") === df("key")

1

2

3

4

5

df("key") === df("key")

That is usually considered a trivially true condition and refused as acceptable.

With spark.sql.selfJoinAutoResolveAmbiguity option enabled (which it is by default), join will automatically resolve ambiguous join conditions into ones that might make sense.

See [SPARK-6231] Join on two tables (generated from same one) is broken.

`crossJoin` Method



crossJoin(right: Dataset[_]): DataFrame

1

2

3

4

5

crossJoin(right: Dataset[_]): DataFrame

crossJoin joins two Datasets using Cross join type with no condition.

Note	`crossJoin` creates an explicit cartesian join that can be very expensive without an extra filter (that can be pushed down).

Type-Preserving Joins — `joinWith` Operators



joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]  (1)
joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

1

2

3

4

5

6

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)] (1)

joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

inner equi-join

joinWith creates a Dataset with two columns _1 and _2 that each contain records for which condition holds.



case class Person(id: Long, name: String, cityId: Long)
case class City(id: Long, name: String)
val family = Seq(
  Person(0, "Agata", 0),
  Person(1, "Iweta", 0),
  Person(2, "Patryk", 2),
  Person(3, "Maksym", 0)).toDS
val cities = Seq(
  City(0, "Warsaw"),
  City(1, "Washington"),
  City(2, "Sopot")).toDS

val joined = family.joinWith(cities, family("cityId") === cities("id"))
scala> joined.printSchema
root
 |-- _1: struct (nullable = false)
 |    |-- id: long (nullable = false)
 |    |-- name: string (nullable = true)
 |    |-- cityId: long (nullable = false)
 |-- _2: struct (nullable = false)
 |    |-- id: long (nullable = false)
 |    |-- name: string (nullable = true)
scala> joined.show
+------------+----------+
|          _1|        _2|
+------------+----------+
| [0,Agata,0]|[0,Warsaw]|
| [1,Iweta,0]|[0,Warsaw]|
|[2,Patryk,2]| [2,Sopot]|
|[3,Maksym,0]|[0,Warsaw]|
+------------+----------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

case class Person(id: Long, name: String, cityId: Long)

case class City(id: Long, name: String)

val family = Seq(

Person(0, "Agata", 0),

Person(1, "Iweta", 0),

Person(2, "Patryk", 2),

Person(3, "Maksym", 0)).toDS

val cities = Seq(

City(0, "Warsaw"),

City(1, "Washington"),

City(2, "Sopot")).toDS

val joined = family.joinWith(cities, family("cityId") === cities("id"))

scala> joined.printSchema

root

|-- _1: struct (nullable = false)

| |-- id: long (nullable = false)

| |-- name: string (nullable = true)

| |-- cityId: long (nullable = false)

|-- _2: struct (nullable = false)

| |-- id: long (nullable = false)

| |-- name: string (nullable = true)

scala> joined.show

+------------+----------+

| _1| _2|

+------------+----------+

| [0,Agata,0]|[0,Warsaw]|

| [1,Iweta,0]|[0,Warsaw]|

|[2,Patryk,2]| [2,Sopot]|

|[3,Maksym,0]|[0,Warsaw]|

+------------+----------+

Note	`joinWith` preserves type-safety with the original object types.

Note	`joinWith` creates a `Dataset` with Join logical plan.

KeyValueGroupedDataset — Typed Grouping

2012-02-06admin阅读(5232)

KeyValueGroupedDataset — Typed Grouping

KeyValueGroupedDataset is an experimental interface to calculate aggregates over groups of objects in a typed Dataset.

Note	RelationalGroupedDataset is used for untyped `Row`-based aggregates.

KeyValueGroupedDataset is created using Dataset.groupByKey operator.



val dataset: Dataset[Token] = ...
scala> val tokensByName = dataset.groupByKey(_.name)
tokensByName: org.apache.spark.sql.KeyValueGroupedDataset[String,Token] = org.apache.spark.sql.KeyValueGroupedDataset@1e3aad46

1

2

3

4

5

6

7

val dataset: Dataset[Token] = ...

scala> val tokensByName = dataset.groupByKey(_.name)

tokensByName: org.apache.spark.sql.KeyValueGroupedDataset[String,Token] = org.apache.spark.sql.KeyValueGroupedDataset@1e3aad46

Table 1. KeyValueGroupedDataset’s Aggregate Operators (KeyValueGroupedDataset API)
Operator	Description
`agg`
`cogroup`
`count`
`flatMapGroups`
`flatMapGroupsWithState`
`keys`
`keyAs`
`mapGroups`
`mapGroupsWithState`
`mapValues`
`reduceGroups`

KeyValueGroupedDataset holds keys that were used for the object.



scala> tokensByName.keys.show
+-----+
|value|
+-----+
|  aaa|
|  bbb|
+-----+

1

2

3

4

5

6

7

8

9

10

11

scala> tokensByName.keys.show

+-----+

|value|

+-----+

| aaa|

| bbb|

+-----+

`aggUntyped` Internal Method



aggUntyped(columns: TypedColumn[_, _]*): Dataset[_]

1

2

3

4

5

aggUntyped(columns: TypedColumn[_, _]*): Dataset[_]

aggUntyped…FIXME

Note	`aggUntyped` is used exclusively when KeyValueGroupedDataset.agg typed operator is used.

`logicalPlan` Internal Method



logicalPlan: AnalysisBarrier

1

2

3

4

5

logicalPlan: AnalysisBarrier

logicalPlan…FIXME

Note	`logicalPlan` is used when…FIXME

RelationalGroupedDataset — Untyped Row-based Grouping

2012-02-05admin阅读(5133)

RelationalGroupedDataset — Untyped Row-based Grouping

RelationalGroupedDataset is an interface to calculate aggregates over groups of rows in a DataFrame.

Note	KeyValueGroupedDataset is used for typed aggregates over groups of custom Scala objects (not Rows).

RelationalGroupedDataset is a result of executing the following grouping operators:

Operator Description

agg

avg

count

max

mean

min

pivot



pivot(pivotColumn: String): RelationalGroupedDataset
pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset
pivot(pivotColumn: Column): RelationalGroupedDataset (1)
pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset (1)

1

2

3

4

5

6

7

8

pivot(pivotColumn: String): RelationalGroupedDataset

pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset

pivot(pivotColumn: Column): RelationalGroupedDataset (1)

pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset (1)

New in 2.4.0

Pivots on a column (with new columns per distinct value)

sum

Note

spark.sql.retainGroupColumns configuration property controls whether to retain columns used for aggregation or not (in RelationalGroupedDataset operators).

spark.sql.retainGroupColumns is enabled by default.



scala> spark.conf.get("spark.sql.retainGroupColumns")
res1: String = true

// Use dataFrameRetainGroupColumns method for type-safe access to the current value
import spark.sessionState.conf
scala> conf.dataFrameRetainGroupColumns
res2: Boolean = true

1

2

3

4

5

6

7

8

9

10

11

scala> spark.conf.get("spark.sql.retainGroupColumns")

res1: String = true

// Use dataFrameRetainGroupColumns method for type-safe access to the current value

import spark.sessionState.conf

scala> conf.dataFrameRetainGroupColumns

res2: Boolean = true

Computing Aggregates Using Aggregate Column Expressions or Function Names — `agg` Operator



agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

1

2

3

4

5

6

7

agg(expr: Column, exprs: Column*): DataFrame

agg(exprs: Map[String, String]): DataFrame

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

agg creates a DataFrame with the rows being the result of executing grouping expressions (specified using columns or names) over row groups.

Note	You can use untyped or typed column expressions.



val countsAndSums = spark.
  range(10).  // <-- 10-element Dataset
  withColumn("group", 'id % 2).  // <-- define grouping column
  groupBy("group"). // <-- group by groups
  agg(count("id") as "count", sum("id") as "sum")
scala> countsAndSums.show
+-----+-----+---+
|group|count|sum|
+-----+-----+---+
|    0|    5| 20|
|    1|    5| 25|
+-----+-----+---+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

val countsAndSums = spark.

range(10). // <-- 10-element Dataset

withColumn("group", 'id % 2). // <-- define grouping column

groupBy("group"). // <-- group by groups

agg(count("id") as "count", sum("id") as "sum")

scala> countsAndSums.show

+-----+-----+---+

|group|count|sum|

+-----+-----+---+

| 0| 5| 20|

| 1| 5| 25|

+-----+-----+---+

Internally, agg creates a DataFrame with Aggregate or Pivot logical operators.



// groupBy above
scala> println(countsAndSums.queryExecution.logical.numberedTreeString)
00 'Aggregate [group#179L], [group#179L, count('id) AS count#188, sum('id) AS sum#190]
01 +- Project [id#176L, (id#176L % cast(2 as bigint)) AS group#179L]
02    +- Range (0, 10, step=1, splits=Some(8))

// rollup operator
val rollupQ = spark.range(2).rollup('id).agg(count('id))
scala> println(rollupQ.queryExecution.logical.numberedTreeString)
00 'Aggregate [rollup('id)], [unresolvedalias('id, None), count('id) AS count(id)#267]
01 +- Range (0, 2, step=1, splits=Some(8))

// cube operator
val cubeQ = spark.range(2).cube('id).agg(count('id))
scala> println(cubeQ.queryExecution.logical.numberedTreeString)
00 'Aggregate [cube('id)], [unresolvedalias('id, None), count('id) AS count(id)#280]
01 +- Range (0, 2, step=1, splits=Some(8))

// pivot operator
val pivotQ = spark.
  range(10).
  withColumn("group", 'id % 2).
  groupBy("group").
  pivot("group").
  agg(count("id"))
scala> println(pivotQ.queryExecution.logical.numberedTreeString)
00 'Pivot [group#296L], group#296: bigint, [0, 1], [count('id)]
01 +- Project [id#293L, (id#293L % cast(2 as bigint)) AS group#296L]
02    +- Range (0, 10, step=1, splits=Some(8))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

// groupBy above

scala> println(countsAndSums.queryExecution.logical.numberedTreeString)

00 'Aggregate [group#179L], [group#179L, count('id) AS count#188, sum('id) AS sum#190]

01 +- Project [id#176L, (id#176L % cast(2 as bigint)) AS group#179L]

02 +- Range (0, 10, step=1, splits=Some(8))

// rollup operator

val rollupQ = spark.range(2).rollup('id).agg(count('id))

scala> println(rollupQ.queryExecution.logical.numberedTreeString)

00 'Aggregate [rollup('id)], [unresolvedalias('id, None), count('id) AS count(id)#267]

01 +- Range (0, 2, step=1, splits=Some(8))

// cube operator

val cubeQ = spark.range(2).cube('id).agg(count('id))

scala> println(cubeQ.queryExecution.logical.numberedTreeString)

00 'Aggregate [cube('id)], [unresolvedalias('id, None), count('id) AS count(id)#280]

01 +- Range (0, 2, step=1, splits=Some(8))

// pivot operator

val pivotQ = spark.

range(10).

withColumn("group", 'id % 2).

groupBy("group").

pivot("group").

agg(count("id"))

scala> println(pivotQ.queryExecution.logical.numberedTreeString)

00 'Pivot [group#296L], group#296: bigint, [0, 1], [count('id)]

01 +- Project [id#293L, (id#293L % cast(2 as bigint)) AS group#296L]

02 +- Range (0, 10, step=1, splits=Some(8))

Creating DataFrame from Aggregate Expressions — `toDF` Internal Method



toDF(aggExprs: Seq[Expression]): DataFrame

1

2

3

4

5

toDF(aggExprs: Seq[Expression]): DataFrame

Caution

FIXME

Internally, toDF branches off per group type.

Caution

FIXME

For PivotType, toDF creates a DataFrame with Pivot unary logical operator.

Note	`toDF` is used when the following `RelationalGroupedDataset` operators are used: agg and count mean, max, avg, min and sum (indirectly through aggregateNumericColumns)

`aggregateNumericColumns` Internal Method



aggregateNumericColumns(colNames: String*)(f: Expression => AggregateFunction): DataFrame

1

2

3

4

5

aggregateNumericColumns(colNames: String*)(f: Expression => AggregateFunction): DataFrame

aggregateNumericColumns…FIXME

Note	`aggregateNumericColumns` is used when the following `RelationalGroupedDataset` operators are used: mean, max, avg, min and sum.

Creating RelationalGroupedDataset Instance

RelationalGroupedDataset takes the following when created:

DataFrame
Grouping expressions
Group type (to indicate the “source” operator)
- GroupByType for groupBy
- CubeType
- RollupType
- PivotType

`pivot` Operator



pivot(pivotColumn: String): RelationalGroupedDataset (1)
pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset (2)
pivot(pivotColumn: Column): RelationalGroupedDataset (3)
pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset (3)

1

2

3

4

5

6

7

8

pivot(pivotColumn: String): RelationalGroupedDataset (1)

pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset (2)

pivot(pivotColumn: Column): RelationalGroupedDataset (3)

pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset (3)

Selects distinct and sorted values on pivotColumn and calls the other pivot (that results in 3 extra “scanning” jobs)
Preferred as more efficient because the unique values are aleady provided
New in 2.4.0

pivot pivots on a pivotColumn column, i.e. adds new columns per distinct values in pivotColumn.

Note	`pivot` is only supported after groupBy operation.

Note	Only one `pivot` operation is supported on a `RelationalGroupedDataset`.



val visits = Seq(
  (0, "Warsaw", 2015),
  (1, "Warsaw", 2016),
  (2, "Boston", 2017)
).toDF("id", "city", "year")

val q = visits
  .groupBy("city")  // <-- rows in pivot table
  .pivot("year")    // <-- columns (unique values queried)
  .count()          // <-- values in cells
scala> q.show
+------+----+----+----+
|  city|2015|2016|2017|
+------+----+----+----+
|Warsaw|   1|   1|null|
|Boston|null|null|   1|
+------+----+----+----+

scala> q.explain
== Physical Plan ==
HashAggregate(keys=[city#8], functions=[pivotfirst(year#9, count(1) AS `count`#222L, 2015, 2016, 2017, 0, 0)])
+- Exchange hashpartitioning(city#8, 200)
   +- HashAggregate(keys=[city#8], functions=[partial_pivotfirst(year#9, count(1) AS `count`#222L, 2015, 2016, 2017, 0, 0)])
      +- *HashAggregate(keys=[city#8, year#9], functions=[count(1)])
         +- Exchange hashpartitioning(city#8, year#9, 200)
            +- *HashAggregate(keys=[city#8, year#9], functions=[partial_count(1)])
               +- LocalTableScan [city#8, year#9]

scala> visits
  .groupBy('city)
  .pivot("year", Seq("2015")) // <-- one column in pivot table
  .count
  .show
+------+----+
|  city|2015|
+------+----+
|Warsaw|   1|
|Boston|null|
+------+----+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

val visits = Seq(

(0, "Warsaw", 2015),

(1, "Warsaw", 2016),

(2, "Boston", 2017)

).toDF("id", "city", "year")

val q = visits

.groupBy("city") // <-- rows in pivot table

.pivot("year") // <-- columns (unique values queried)

.count() // <-- values in cells

scala> q.show

+------+----+----+----+

| city|2015|2016|2017|

+------+----+----+----+

|Warsaw| 1| 1|null|

+------+----+----+----+

scala> q.explain

== Physical Plan ==

HashAggregate(keys=[city#8], functions=[pivotfirst(year#9, count(1) AS `count`#222L, 2015, 2016, 2017, 0, 0)])

+- Exchange hashpartitioning(city#8, 200)

+- HashAggregate(keys=[city#8], functions=[partial_pivotfirst(year#9, count(1) AS `count`#222L, 2015, 2016, 2017, 0, 0)])

+- *HashAggregate(keys=[city#8, year#9], functions=[count(1)])

+- Exchange hashpartitioning(city#8, year#9, 200)

+- *HashAggregate(keys=[city#8, year#9], functions=[partial_count(1)])

+- LocalTableScan [city#8, year#9]

scala> visits

.groupBy('city)

.pivot("year", Seq("2015")) // <-- one column in pivot table

.count

.show

+------+----+

| city|2015|

+------+----+

|Warsaw| 1|

|Boston|null|

+------+----+

Important

Use pivot with a list of distinct values to pivot on so Spark does not have to compute the list itself (and run three extra “scanning” jobs).

Figure 1. pivot in web UI (Distinct Values Defined Explicitly)

Figure 2. pivot in web UI — Three Extra Scanning Jobs Due to Unspecified Distinct Values

Note	spark.sql.pivotMaxValues (default: `10000`) controls the maximum number of (distinct) values that will be collected without error (when doing `pivot` without specifying the values for the pivot column).

Internally, pivot creates a RelationalGroupedDataset with PivotType group type and pivotColumn resolved using the DataFrame’s columns with values as Literal expressions.

Note

toDF internal method maps PivotType group type to a DataFrame with Pivot unary logical operator.



scala> q.queryExecution.logical
res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Pivot [city#8], year#9: int, [2015, 2016, 2017], [count(1) AS count#24L]
+- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]
   +- LocalRelation [_1#3, _2#4, _3#5]

1

2

3

4

5

6

7

8

9

scala> q.queryExecution.logical

res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =

Pivot [city#8], year#9: int, [2015, 2016, 2017], [count(1) AS count#24L]

+- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]

+- LocalRelation [_1#3, _2#4, _3#5]

`strToExpr` Internal Method



strToExpr(expr: String): (Expression => Expression)

1

2

3

4

5

strToExpr(expr: String): (Expression => Expression)

strToExpr…FIXME

Note	`strToExpr` is used exclusively when `RelationalGroupedDataset` is requested to agg with aggregation functions specified by name

`alias` Method



alias(expr: Expression): NamedExpression

1

2

3

4

5

alias(expr: Expression): NamedExpression

alias…FIXME

Note	`alias` is used exclusively when `RelationalGroupedDataset` is requested to create a DataFrame from aggregate expressions.

Basic Aggregation — Typed and Untyped Grouping Operators

2012-02-04admin阅读(1785)

Basic Aggregation — Typed and Untyped Grouping Operators

You can calculate aggregates over a group of rows in a Dataset using aggregate operators (possibly with aggregate functions).

Table 1. Aggregate Operators
Operator	Return Type	Description
agg	RelationalGroupedDataset	Aggregates with or without grouping (i.e. over an entire Dataset)
groupBy	RelationalGroupedDataset	Used for untyped aggregates using DataFrames. Grouping is described using column expressions or column names.
groupByKey	KeyValueGroupedDataset	Used for typed aggregates using Datasets with records grouped by a key-defining discriminator function.

Note	Aggregate functions without aggregate operators return a single value. If you want to find the aggregate values for each unique value (in a column), you should groupBy first (over this column) to build the groups.

Note

You can also use SparkSession to execute good ol’ SQL with GROUP BY should you prefer.



val spark: SparkSession = ???
spark.sql("SELECT COUNT(*) FROM sales GROUP BY city")

1

2

3

4

5

6

val spark: SparkSession = ???

spark.sql("SELECT COUNT(*) FROM sales GROUP BY city")

SQL or Dataset API’s operators go through the same query planning and optimizations, and have the same performance characteristic in the end.

Aggregates Over Subset Of or Whole Dataset — `agg` Operator



agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

1

2

3

4

5

6

7

agg(expr: Column, exprs: Column*): DataFrame

agg(exprs: Map[String, String]): DataFrame

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

agg applies an aggregate function on a subset or the entire Dataset (i.e. considering the entire data set as one group).

Note	`agg` on a `Dataset` is simply a shortcut for groupBy().agg(…).



scala> spark.range(10).agg(sum('id) as "sum").show
+---+
|sum|
+---+
| 45|
+---+

1

2

3

4

5

6

7

8

9

10

scala> spark.range(10).agg(sum('id) as "sum").show

+---+

|sum|

+---+

| 45|

+---+

agg can compute aggregate expressions on all the records in a Dataset.

Untyped Grouping — `groupBy` Operator



groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

1

2

3

4

5

6

groupBy(cols: Column*): RelationalGroupedDataset

groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy operator groups the rows in a Dataset by columns (as Column expressions or names).

groupBy gives a RelationalGroupedDataset to execute aggregate functions or operators.



// 10^3-record large data set
val ints = 1 to math.pow(10, 3).toInt
val nms = ints.toDF("n").withColumn("m", 'n % 2)
scala> nms.count
res0: Long = 1000

val q = nms.
  groupBy('m).
  agg(sum('n) as "sum").
  orderBy('m)
scala> q.show
+---+------+
|  m|   sum|
+---+------+
|  0|250500|
|  1|250000|
+---+------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

// 10^3-record large data set

val ints = 1 to math.pow(10, 3).toInt

val nms = ints.toDF("n").withColumn("m", 'n % 2)

scala> nms.count

res0: Long = 1000

val q = nms.

groupBy('m).

agg(sum('n) as "sum").

orderBy('m)

scala> q.show

+---+------+

| m| sum|

+---+------+

| 0|250500|

| 1|250000|

+---+------+

Internally, groupBy resolves column names (possibly quoted) and creates a RelationalGroupedDataset (with groupType being GroupByType).

Note	The following uses the data setup as described in Test Setup section below.



scala> tokens.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa|      100| 0.12|
| aaa|      200| 0.29|
| bbb|      200| 0.53|
| bbb|      300| 0.42|
+----+---------+-----+

scala> tokens.groupBy('name).avg().show
+----+--------------+----------+
|name|avg(productId)|avg(score)|
+----+--------------+----------+
| aaa|         150.0|     0.205|
| bbb|         250.0|     0.475|
+----+--------------+----------+

scala> tokens.groupBy('name, 'productId).agg(Map("score" -> "avg")).show
+----+---------+----------+
|name|productId|avg(score)|
+----+---------+----------+
| aaa|      200|      0.29|
| bbb|      200|      0.53|
| bbb|      300|      0.42|
| aaa|      100|      0.12|
+----+---------+----------+

scala> tokens.groupBy('name).count.show
+----+-----+
|name|count|
+----+-----+
| aaa|    2|
| bbb|    2|
+----+-----+

scala> tokens.groupBy('name).max("score").show
+----+----------+
|name|max(score)|
+----+----------+
| aaa|      0.29|
| bbb|      0.53|
+----+----------+

scala> tokens.groupBy('name).sum("score").show
+----+----------+
|name|sum(score)|
+----+----------+
| aaa|      0.41|
| bbb|      0.95|
+----+----------+

scala> tokens.groupBy('productId).sum("score").show
+---------+------------------+
|productId|        sum(score)|
+---------+------------------+
|      300|              0.42|
|      100|              0.12|
|      200|0.8200000000000001|
+---------+------------------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

scala> tokens.show

+----+---------+-----+

|name|productId|score|

+----+---------+-----+

| aaa| 100| 0.12|

| aaa| 200| 0.29|

| bbb| 200| 0.53|

| bbb| 300| 0.42|

+----+---------+-----+

scala> tokens.groupBy('name).avg().show

+----+--------------+----------+

|name|avg(productId)|avg(score)|

+----+--------------+----------+

| aaa| 150.0| 0.205|

| bbb| 250.0| 0.475|

+----+--------------+----------+

scala> tokens.groupBy('name, 'productId).agg(Map("score" -> "avg")).show

+----+---------+----------+

|name|productId|avg(score)|

+----+---------+----------+

| aaa| 200| 0.29|

| bbb| 200| 0.53|

| bbb| 300| 0.42|

| aaa| 100| 0.12|

+----+---------+----------+

scala> tokens.groupBy('name).count.show

+----+-----+

|name|count|

+----+-----+

| aaa| 2|

| bbb| 2|

+----+-----+

scala> tokens.groupBy('name).max("score").show

+----+----------+

|name|max(score)|

+----+----------+

| aaa| 0.29|

| bbb| 0.53|

+----+----------+

scala> tokens.groupBy('name).sum("score").show

+----+----------+

|name|sum(score)|

+----+----------+

| aaa| 0.41|

| bbb| 0.95|

+----+----------+

scala> tokens.groupBy('productId).sum("score").show

+---------+------------------+

|productId| sum(score)|

+---------+------------------+

| 300| 0.42|

| 100| 0.12|

| 200|0.8200000000000001|

+---------+------------------+

Typed Grouping — `groupByKey` Operator



groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

1

2

3

4

5

groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

groupByKey groups records (of type T) by the input func and in the end returns a KeyValueGroupedDataset to apply aggregation to.

Note	`groupByKey` is `Dataset`‘s experimental API.



scala> tokens.groupByKey(_.productId).count.orderBy($"value").show
+-----+--------+
|value|count(1)|
+-----+--------+
|  100|       1|
|  200|       2|
|  300|       1|
+-----+--------+

import org.apache.spark.sql.expressions.scalalang._
val q = tokens.
  groupByKey(_.productId).
  agg(typed.sum[Token](_.score)).
  toDF("productId", "sum").
  orderBy('productId)
scala> q.show
+---------+------------------+
|productId|               sum|
+---------+------------------+
|      100|              0.12|
|      200|0.8200000000000001|
|      300|              0.42|
+---------+------------------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

scala> tokens.groupByKey(_.productId).count.orderBy($"value").show

+-----+--------+

|value|count(1)|

+-----+--------+

| 100| 1|

| 200| 2|

| 300| 1|

+-----+--------+

import org.apache.spark.sql.expressions.scalalang._

val q = tokens.

groupByKey(_.productId).

agg(typed.sum[Token](_.score)).

toDF("productId", "sum").

orderBy('productId)

scala> q.show

+---------+------------------+

|productId| sum|

+---------+------------------+

| 100| 0.12|

| 200|0.8200000000000001|

| 300| 0.42|

+---------+------------------+

Test Setup

This is a setup for learning GroupedData. Paste it into Spark Shell using :paste.



import spark.implicits._

case class Token(name: String, productId: Int, score: Double)
val data = Seq(
  Token("aaa", 100, 0.12),
  Token("aaa", 200, 0.29),
  Token("bbb", 200, 0.53),
  Token("bbb", 300, 0.42))
val tokens = data.toDS.cache  (1)

1

2

3

4

5

6

7

8

9

10

11

12

13

import spark.implicits._

case class Token(name: String, productId: Int, score: Double)

val data = Seq(

Token("aaa", 100, 0.12),

Token("aaa", 200, 0.29),

Token("bbb", 200, 0.53),

Token("bbb", 300, 0.42))

val tokens = data.toDS.cache (1)

Cache the dataset so the following queries won’t load/recompute data over and over again.

TypedColumn

2012-02-03admin阅读(2148)

TypedColumn

TypedColumn is a Column with the ExpressionEncoder for the types of the input and the output.

TypedColumn is created using as operator on a Column.



scala> val id = $"id".as[Int]
id: org.apache.spark.sql.TypedColumn[Any,Int] = id

scala> id.expr
res1: org.apache.spark.sql.catalyst.expressions.Expression = 'id

1

2

3

4

5

6

7

8

9

scala> val id = $"id".as[Int]

id: org.apache.spark.sql.TypedColumn[Any,Int] = id

scala> id.expr

res1: org.apache.spark.sql.catalyst.expressions.Expression = 'id

`name` Operator



name(alias: String): TypedColumn[T, U]

1

2

3

4

5

name(alias: String): TypedColumn[T, U]

Note	`name` is part of Column Contract to…FIXME.

name…FIXME

Note	`name` is used when…FIXME

Creating TypedColumn — `withInputType` Internal Method



withInputType(
  inputEncoder: ExpressionEncoder[_],
  inputAttributes: Seq[Attribute]): TypedColumn[T, U]

1

2

3

4

5

6

7

withInputType(

inputEncoder: ExpressionEncoder[_],

inputAttributes: Seq[Attribute]): TypedColumn[T, U]

withInputType…FIXME

Note	`withInputType` is used when the following typed operators are used: Dataset.select KeyValueGroupedDataset.agg RelationalGroupedDataset.agg

Creating TypedColumn Instance

TypedColumn takes the following when created:

Catalyst expression
ExpressionEncoder of the column results

TypedColumn initializes the internal registries and counters.

Column API — Column Operators

2012-02-02admin阅读(1941)

Column API — Column Operators

Column API is a set of operators to work with values in a column (of a Dataset).

Operator Description

asc



asc: Column

1

2

3

4

5

asc: Column

asc_nulls_first



asc_nulls_first: Column

1

2

3

4

5

asc_nulls_first: Column

asc_nulls_last



asc_nulls_last: Column

1

2

3

4

5

asc_nulls_last: Column

desc



desc: Column

1

2

3

4

5

desc: Column

desc_nulls_first



desc_nulls_first: Column

1

2

3

4

5

desc_nulls_first: Column

desc_nulls_last



desc_nulls_last: Column

1

2

3

4

5

desc_nulls_last: Column

isin



isin(list: Any*): Column

1

2

3

4

5

isin(list: Any*): Column

isInCollection



isInCollection(values: scala.collection.Iterable[_]): Column

1

2

3

4

5

isInCollection(values: scala.collection.Iterable[_]): Column

(New in 2.4.0) An expression operator that is true if the value of the column is in the given values collection

isInCollection is simply a synonym of isin operator.

`isin` Operator



isin(list: Any*): Column

1

2

3

4

5

isin(list: Any*): Column

Internally, isin creates a Column with In predicate expression.



val ids = Seq((1, 2, 2), (2, 3, 1)).toDF("x", "y", "id")
scala> ids.show
+---+---+---+
|  x|  y| id|
+---+---+---+
|  1|  2|  2|
|  2|  3|  1|
+---+---+---+

val c = $"id" isin ($"x", $"y")
val q = ids.filter(c)
scala> q.show
+---+---+---+
|  x|  y| id|
+---+---+---+
|  1|  2|  2|
+---+---+---+

// Note that isin accepts non-Column values
val c = $"id" isin ("x", "y")
val q = ids.filter(c)
scala> q.show
+---+---+---+
|  x|  y| id|
+---+---+---+
+---+---+---+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

val ids = Seq((1, 2, 2), (2, 3, 1)).toDF("x", "y", "id")

scala> ids.show

+---+---+---+

| x| y| id|

+---+---+---+

| 1| 2| 2|

| 2| 3| 1|

+---+---+---+

val c = $"id" isin ($"x", $"y")

val q = ids.filter(c)

scala> q.show

+---+---+---+

| x| y| id|

+---+---+---+

| 1| 2| 2|

+---+---+---+

// Note that isin accepts non-Column values

val c = $"id" isin ("x", "y")

val q = ids.filter(c)

scala> q.show

+---+---+---+

| x| y| id|

+---+---+---+

Column

2012-02-01admin阅读(5331)

Column

Column represents a column in a Dataset that holds a Catalyst Expression that produces a value per row.

Note	A `Column` is a value generator for every row in a `Dataset`.

A special column * references all columns in a Dataset.

With the implicits converstions imported, you can create “free” column references using Scala’s symbols.



val spark: SparkSession = ...
import spark.implicits._

import org.apache.spark.sql.Column
scala> val nameCol: Column = 'name
nameCol: org.apache.spark.sql.Column = name

1

2

3

4

5

6

7

8

9

10

val spark: SparkSession = ...

import spark.implicits._

import org.apache.spark.sql.Column

scala> val nameCol: Column = 'name

nameCol: org.apache.spark.sql.Column = name

Note	“Free” column references are `Column`s with no association to a `Dataset`.

You can also create free column references from $-prefixed strings.



// Note that $ alone creates a ColumnName
scala> val idCol = $"id"
idCol: org.apache.spark.sql.ColumnName = id

import org.apache.spark.sql.Column

// The target type triggers the implicit conversion to Column
scala> val idCol: Column = $"id"
idCol: org.apache.spark.sql.Column = id

1

2

3

4

5

6

7

8

9

10

11

12

13

// Note that $ alone creates a ColumnName

scala> val idCol = $"id"

idCol: org.apache.spark.sql.ColumnName = id

import org.apache.spark.sql.Column

// The target type triggers the implicit conversion to Column

scala> val idCol: Column = $"id"

idCol: org.apache.spark.sql.Column = id

Beside using the implicits conversions, you can create columns using col and column functions.



import org.apache.spark.sql.functions._

scala> val nameCol = col("name")
nameCol: org.apache.spark.sql.Column = name

scala> val cityCol = column("city")
cityCol: org.apache.spark.sql.Column = city

1

2

3

4

5

6

7

8

9

10

11

import org.apache.spark.sql.functions._

scala> val nameCol = col("name")

nameCol: org.apache.spark.sql.Column = name

scala> val cityCol = column("city")

cityCol: org.apache.spark.sql.Column = city

Finally, you can create a bound Column using the Dataset the column is supposed to be part of using Dataset.apply factory method or Dataset.col operator.

Note	You can use bound `Column` references only with the `Dataset`s they have been created from.



scala> val textCol = dataset.col("text")
textCol: org.apache.spark.sql.Column = text

scala> val idCol = dataset.apply("id")
idCol: org.apache.spark.sql.Column = id

scala> val idCol = dataset("id")
idCol: org.apache.spark.sql.Column = id

1

2

3

4

5

6

7

8

9

10

11

12

scala> val textCol = dataset.col("text")

textCol: org.apache.spark.sql.Column = text

scala> val idCol = dataset.apply("id")

idCol: org.apache.spark.sql.Column = id

scala> val idCol = dataset("id")

idCol: org.apache.spark.sql.Column = id

You can reference nested columns using . (dot).

Table 1. Column Operators
Operator	Description
as	Specifying type hint about the expected return value of the column
name

Note

Column has a reference to Catalyst’s Expression it was created for using expr method.



scala> window('time, "5 seconds").expr
res0: org.apache.spark.sql.catalyst.expressions.Expression = timewindow('time, 5000000, 5000000, 0) AS window#1

1

2

3

4

5

6

scala> window('time, "5 seconds").expr

res0: org.apache.spark.sql.catalyst.expressions.Expression = timewindow('time, 5000000, 5000000, 0) AS window#1

Tip	Read about typed column references in TypedColumn Expressions.

Specifying Type Hint — `as` Operator



as[U : Encoder]: TypedColumn[Any, U]

1

2

3

4

5

as[U : Encoder]: TypedColumn[Any, U]

as creates a TypedColumn (that gives a type hint about the expected return value of the column).



scala> $"id".as[Int]
res1: org.apache.spark.sql.TypedColumn[Any,Int] = id

1

2

3

4

5

6

scala> $"id".as[Int]

res1: org.apache.spark.sql.TypedColumn[Any,Int] = id

`name` Operator



name(alias: String): Column

1

2

3

4

5

name(alias: String): Column

name…FIXME

Note	`name` is used when…FIXME

Adding Column to Dataset — `withColumn` Method



withColumn(colName: String, col: Column): DataFrame

1

2

3

4

5

withColumn(colName: String, col: Column): DataFrame

withColumn method returns a new DataFrame with the new column col with colName name added.

Note	`withColumn` can replace an existing `colName` column.



scala> val df = Seq((1, "jeden"), (2, "dwa")).toDF("number", "polish")
df: org.apache.spark.sql.DataFrame = [number: int, polish: string]

scala> df.show
+------+------+
|number|polish|
+------+------+
|     1| jeden|
|     2|   dwa|
+------+------+

scala> df.withColumn("polish", lit(1)).show
+------+------+
|number|polish|
+------+------+
|     1|     1|
|     2|     1|
+------+------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

scala> val df = Seq((1, "jeden"), (2, "dwa")).toDF("number", "polish")

df: org.apache.spark.sql.DataFrame = [number: int, polish: string]

scala> df.show

+------+------+

|number|polish|

+------+------+

| 1| jeden|

| 2| dwa|

+------+------+

scala> df.withColumn("polish", lit(1)).show

+------+------+

|number|polish|

+------+------+

| 1| 1|

| 2| 1|

+------+------+

You can add new columns do a Dataset using withColumn method.



val spark: SparkSession = ...
val dataset = spark.range(5)

// Add a new column called "group"
scala> dataset.withColumn("group", 'id % 2).show
+---+-----+
| id|group|
+---+-----+
|  0|    0|
|  1|    1|
|  2|    0|
|  3|    1|
|  4|    0|
+---+-----+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

val spark: SparkSession = ...

val dataset = spark.range(5)

// Add a new column called "group"

scala> dataset.withColumn("group", 'id % 2).show

+---+-----+

| id|group|

+---+-----+

| 0| 0|

| 1| 1|

| 2| 0|

| 3| 1|

| 4| 0|

+---+-----+

Creating Column Instance For Catalyst Expression — `apply` Factory Method



val spark: SparkSession = ...
case class Word(id: Long, text: String)
val dataset = Seq(Word(0, "hello"), Word(1, "spark")).toDS

scala> val idCol = dataset.apply("id")
idCol: org.apache.spark.sql.Column = id

// or using Scala's magic a little bit
// the following is equivalent to the above explicit apply call
scala> val idCol = dataset("id")
idCol: org.apache.spark.sql.Column = id

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

val spark: SparkSession = ...

case class Word(id: Long, text: String)

val dataset = Seq(Word(0, "hello"), Word(1, "spark")).toDS

scala> val idCol = dataset.apply("id")

idCol: org.apache.spark.sql.Column = id

// or using Scala's magic a little bit

// the following is equivalent to the above explicit apply call

scala> val idCol = dataset("id")

idCol: org.apache.spark.sql.Column = id

`like` Operator

Caution

FIXME



scala> df("id") like "0"
res0: org.apache.spark.sql.Column = id LIKE 0

scala> df.filter('id like "0").show
+---+-----+
| id| text|
+---+-----+
|  0|hello|
+---+-----+

1

2

3

4

5

6

7

8

9

10

11

12

13

scala> df("id") like "0"

res0: org.apache.spark.sql.Column = id LIKE 0

scala> df.filter('id like "0").show

+---+-----+

| id| text|

+---+-----+

| 0|hello|

+---+-----+

Symbols As Column Names



scala> val df = Seq((0, "hello"), (1, "world")).toDF("id", "text")
df: org.apache.spark.sql.DataFrame = [id: int, text: string]

scala> df.select('id)
res0: org.apache.spark.sql.DataFrame = [id: int]

scala> df.select('id).show
+---+
| id|
+---+
|  0|
|  1|
+---+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

scala> val df = Seq((0, "hello"), (1, "world")).toDF("id", "text")

df: org.apache.spark.sql.DataFrame = [id: int, text: string]

scala> df.select('id)

res0: org.apache.spark.sql.DataFrame = [id: int]

scala> df.select('id).show

+---+

| id|

+---+

| 0|

| 1|

+---+

Defining Windowing Column (Analytic Clause) — `over` Operator



over(): Column
over(window: WindowSpec): Column

1

2

3

4

5

6

over(): Column

over(window: WindowSpec): Column

over creates a windowing column (aka analytic clause) that allows to execute a aggregate function over a window (i.e. a group of records that are in some relation to the current record).

Tip	Read up on windowed aggregation in Spark SQL in Window Aggregate Functions.



scala> val overUnspecifiedFrame = $"someColumn".over()
overUnspecifiedFrame: org.apache.spark.sql.Column = someColumn OVER (UnspecifiedFrame)

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.WindowSpec
val spec: WindowSpec = Window.rangeBetween(Window.unboundedPreceding, Window.currentRow)
scala> val overRange = $"someColumn" over spec
overRange: org.apache.spark.sql.Column = someColumn OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

1

2

3

4

5

6

7

8

9

10

11

12

scala> val overUnspecifiedFrame = $"someColumn".over()

overUnspecifiedFrame: org.apache.spark.sql.Column = someColumn OVER (UnspecifiedFrame)

import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.expressions.WindowSpec

val spec: WindowSpec = Window.rangeBetween(Window.unboundedPreceding, Window.currentRow)

scala> val overRange = $"someColumn" over spec

overRange: org.apache.spark.sql.Column = someColumn OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

`cast` Operator

cast method casts a column to a data type. It makes for type-safe maps with Row objects of the proper type (not Any).



cast(to: String): Column
cast(to: DataType): Column

1

2

3

4

5

6

cast(to: String): Column

cast(to: DataType): Column

cast uses CatalystSqlParser to parse the data type from its canonical string representation.

cast Example



scala> val df = Seq((0f, "hello")).toDF("label", "text")
df: org.apache.spark.sql.DataFrame = [label: float, text: string]

scala> df.printSchema
root
 |-- label: float (nullable = false)
 |-- text: string (nullable = true)

// without cast
import org.apache.spark.sql.Row
scala> df.select("label").map { case Row(label) => label.getClass.getName }.show(false)
+---------------+
|value          |
+---------------+
|java.lang.Float|
+---------------+

// with cast
import org.apache.spark.sql.types.DoubleType
scala> df.select(col("label").cast(DoubleType)).map { case Row(label) => label.getClass.getName }.show(false)
+----------------+
|value           |
+----------------+
|java.lang.Double|
+----------------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

scala> val df = Seq((0f, "hello")).toDF("label", "text")

df: org.apache.spark.sql.DataFrame = [label: float, text: string]

scala> df.printSchema

root

|-- label: float (nullable = false)

|-- text: string (nullable = true)

// without cast

import org.apache.spark.sql.Row

scala> df.select("label").map { case Row(label) => label.getClass.getName }.show(false)

+---------------+

|value |

+---------------+

|java.lang.Float|

+---------------+

// with cast

import org.apache.spark.sql.types.DoubleType

scala> df.select(col("label").cast(DoubleType)).map { case Row(label) => label.getClass.getName }.show(false)

+----------------+

|value |

+----------------+

|java.lang.Double|

+----------------+

`generateAlias` Method



generateAlias(e: Expression): String

1

2

3

4

5

generateAlias(e: Expression): String

generateAlias…FIXME

Note	`generateAlias` is used when: `Column` is requested to named `RelationalGroupedDataset` is requested to alias

`named` Method



named: NamedExpression

1

2

3

4

5

named: NamedExpression

named…FIXME

Note	`named` is used when the following operators are used: Dataset.select KeyValueGroupedDataset.agg

DataFrameStatFunctions — Working With Statistic Functions

2012-01-31admin阅读(1807)

DataFrameStatFunctions — Working With Statistic Functions

DataFrameStatFunctions is used to work with statistic functions in a structured query (a DataFrame).

Table 1. DataFrameStatFunctions API

Method

Description

approxQuantile



approxQuantile(
  cols: Array[String],
  probabilities: Array[Double],
  relativeError: Double): Array[Array[Double]]
approxQuantile(
  col: String,
  probabilities: Array[Double],
  relativeError: Double): Array[Double]

1

2

3

4

5

6

7

8

9

10

11

12

approxQuantile(

cols: Array[String],

probabilities: Array[Double],

relativeError: Double): Array[Array[Double]]

approxQuantile(

col: String,

probabilities: Array[Double],

relativeError: Double): Array[Double]

bloomFilter



bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

1

2

3

4

5

6

7

8

bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

corr



corr(col1: String, col2: String): Double
corr(col1: String, col2: String, method: String): Double

1

2

3

4

5

6

corr(col1: String, col2: String): Double

corr(col1: String, col2: String, method: String): Double

countMinSketch



countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

1

2

3

4

5

6

7

8

countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch

countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

cov



cov(col1: String, col2: String): Double

1

2

3

4

5

cov(col1: String, col2: String): Double

crosstab



crosstab(col1: String, col2: String): DataFrame

1

2

3

4

5

crosstab(col1: String, col2: String): DataFrame

freqItems



freqItems(cols: Array[String]): DataFrame
freqItems(cols: Array[String], support: Double): DataFrame
freqItems(cols: Seq[String]): DataFrame
freqItems(cols: Seq[String], support: Double): DataFrame

1

2

3

4

5

6

7

8

freqItems(cols: Array[String]): DataFrame

freqItems(cols: Array[String], support: Double): DataFrame

freqItems(cols: Seq[String]): DataFrame

freqItems(cols: Seq[String], support: Double): DataFrame

sampleBy



sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

1

2

3

4

5

sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

DataFrameStatFunctions is available using stat untyped transformation.



val q: DataFrame = ...
q.stat

1

2

3

4

5

6

val q: DataFrame = ...

q.stat

`approxQuantile` Method



approxQuantile(
  cols: Array[String],
  probabilities: Array[Double],
  relativeError: Double): Array[Array[Double]]
approxQuantile(
  col: String,
  probabilities: Array[Double],
  relativeError: Double): Array[Double]

1

2

3

4

5

6

7

8

9

10

11

12

approxQuantile(

cols: Array[String],

probabilities: Array[Double],

relativeError: Double): Array[Array[Double]]

approxQuantile(

col: String,

probabilities: Array[Double],

relativeError: Double): Array[Double]

approxQuantile…FIXME

`bloomFilter` Method



bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

1

2

3

4

5

6

7

8

bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter…FIXME

`buildBloomFilter` Internal Method



buildBloomFilter(col: Column, zero: BloomFilter): BloomFilter

1

2

3

4

5

buildBloomFilter(col: Column, zero: BloomFilter): BloomFilter

buildBloomFilter…FIXME

Note	`convertToDouble` is used when…FIXME

`corr` Method



corr(col1: String, col2: String): Double
corr(col1: String, col2: String, method: String): Double

1

2

3

4

5

6

corr(col1: String, col2: String): Double

corr(col1: String, col2: String, method: String): Double

corr…FIXME

`countMinSketch` Method



countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch
// PRIVATE API
countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch

1

2

3

4

5

6

7

8

9

10

countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch

countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

// PRIVATE API

countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch

countMinSketch…FIXME

`cov` Method



cov(col1: String, col2: String): Double

1

2

3

4

5

cov(col1: String, col2: String): Double

cov…FIXME

`crosstab` Method



crosstab(col1: String, col2: String): DataFrame

1

2

3

4

5

crosstab(col1: String, col2: String): DataFrame

crosstab…FIXME

`freqItems` Method



freqItems(cols: Array[String]): DataFrame
freqItems(cols: Array[String], support: Double): DataFrame
freqItems(cols: Seq[String]): DataFrame
freqItems(cols: Seq[String], support: Double): DataFrame

1

2

3

4

5

6

7

8

freqItems(cols: Array[String]): DataFrame

freqItems(cols: Array[String], support: Double): DataFrame

freqItems(cols: Seq[String]): DataFrame

freqItems(cols: Seq[String], support: Double): DataFrame

freqItems…FIXME

`sampleBy` Method



sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

1

2

3

4

5

sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

sampleBy…FIXME

DataFrameStatFunctions — Working With Statistic Functions

2012-01-31admin阅读(627)

DataFrameStatFunctions — Working With Statistic Functions

DataFrameStatFunctions is used to work with statistic functions in a structured query (a DataFrame).

Table 1. DataFrameStatFunctions API

Method

Description

approxQuantile



approxQuantile(
  cols: Array[String],
  probabilities: Array[Double],
  relativeError: Double): Array[Array[Double]]
approxQuantile(
  col: String,
  probabilities: Array[Double],
  relativeError: Double): Array[Double]

1

2

3

4

5

6

7

8

9

10

11

12

approxQuantile(

cols: Array[String],

probabilities: Array[Double],

relativeError: Double): Array[Array[Double]]

approxQuantile(

col: String,

probabilities: Array[Double],

relativeError: Double): Array[Double]

bloomFilter



bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

1

2

3

4

5

6

7

8

bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

corr



corr(col1: String, col2: String): Double
corr(col1: String, col2: String, method: String): Double

1

2

3

4

5

6

corr(col1: String, col2: String): Double

corr(col1: String, col2: String, method: String): Double

countMinSketch



countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

1

2

3

4

5

6

7

8

countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch

countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

cov



cov(col1: String, col2: String): Double

1

2

3

4

5

cov(col1: String, col2: String): Double

crosstab



crosstab(col1: String, col2: String): DataFrame

1

2

3

4

5

crosstab(col1: String, col2: String): DataFrame

freqItems



freqItems(cols: Array[String]): DataFrame
freqItems(cols: Array[String], support: Double): DataFrame
freqItems(cols: Seq[String]): DataFrame
freqItems(cols: Seq[String], support: Double): DataFrame

1

2

3

4

5

6

7

8

freqItems(cols: Array[String]): DataFrame

freqItems(cols: Array[String], support: Double): DataFrame

freqItems(cols: Seq[String]): DataFrame

freqItems(cols: Seq[String], support: Double): DataFrame

sampleBy



sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

1

2

3

4

5

sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

DataFrameStatFunctions is available using stat untyped transformation.



val q: DataFrame = ...
q.stat

1

2

3

4

5

6

val q: DataFrame = ...

q.stat

`approxQuantile` Method



approxQuantile(
  cols: Array[String],
  probabilities: Array[Double],
  relativeError: Double): Array[Array[Double]]
approxQuantile(
  col: String,
  probabilities: Array[Double],
  relativeError: Double): Array[Double]

1

2

3

4

5

6

7

8

9

10

11

12

approxQuantile(

cols: Array[String],

probabilities: Array[Double],

relativeError: Double): Array[Array[Double]]

approxQuantile(

col: String,

probabilities: Array[Double],

relativeError: Double): Array[Double]

approxQuantile…FIXME

`bloomFilter` Method



bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

1

2

3

4

5

6

7

8

bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter

bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter…FIXME

`buildBloomFilter` Internal Method



buildBloomFilter(col: Column, zero: BloomFilter): BloomFilter

1

2

3

4

5

buildBloomFilter(col: Column, zero: BloomFilter): BloomFilter

buildBloomFilter…FIXME

Note	`convertToDouble` is used when…FIXME

`corr` Method



corr(col1: String, col2: String): Double
corr(col1: String, col2: String, method: String): Double

1

2

3

4

5

6

corr(col1: String, col2: String): Double

corr(col1: String, col2: String, method: String): Double

corr…FIXME

`countMinSketch` Method



countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch
// PRIVATE API
countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch

1

2

3

4

5

6

7

8

9

10

countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch

countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

// PRIVATE API

countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch

countMinSketch…FIXME

`cov` Method



cov(col1: String, col2: String): Double

1

2

3

4

5

cov(col1: String, col2: String): Double

cov…FIXME

`crosstab` Method



crosstab(col1: String, col2: String): DataFrame

1

2

3

4

5

crosstab(col1: String, col2: String): DataFrame

crosstab…FIXME

`freqItems` Method



freqItems(cols: Array[String]): DataFrame
freqItems(cols: Array[String], support: Double): DataFrame
freqItems(cols: Seq[String]): DataFrame
freqItems(cols: Seq[String], support: Double): DataFrame

1

2

3

4

5

6

7

8

freqItems(cols: Array[String]): DataFrame

freqItems(cols: Array[String], support: Double): DataFrame

freqItems(cols: Seq[String]): DataFrame

freqItems(cols: Seq[String], support: Double): DataFrame

freqItems…FIXME

`sampleBy` Method



sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

1

2

3

4

5

sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

sampleBy…FIXME

spark-sql 第55页

Broadcast Joins (aka Map-Side Joins)

Dataset Join Operators

join Operators

crossJoin Method

Type-Preserving Joins — joinWith Operators

KeyValueGroupedDataset — Typed Grouping

aggUntyped Internal Method

logicalPlan Internal Method

RelationalGroupedDataset — Untyped Row-based Grouping

Computing Aggregates Using Aggregate Column Expressions or Function Names — agg Operator

Creating DataFrame from Aggregate Expressions — toDF Internal Method

aggregateNumericColumns Internal Method

Creating RelationalGroupedDataset Instance

pivot Operator

strToExpr Internal Method

alias Method

Basic Aggregation — Typed and Untyped Grouping Operators

Aggregates Over Subset Of or Whole Dataset — agg Operator

Untyped Grouping — groupBy Operator

Typed Grouping — groupByKey Operator

Test Setup

TypedColumn

name Operator

Creating TypedColumn — withInputType Internal Method

Creating TypedColumn Instance

Column API — Column Operators

isin Operator

Column

Specifying Type Hint — as Operator

name Operator

Adding Column to Dataset — withColumn Method

Creating Column Instance For Catalyst Expression — apply Factory Method

like Operator

Symbols As Column Names

Defining Windowing Column (Analytic Clause) — over Operator

cast Operator

cast Example

generateAlias Method

named Method

DataFrameStatFunctions — Working With Statistic Functions

approxQuantile Method

bloomFilter Method

buildBloomFilter Internal Method

corr Method

countMinSketch Method

cov Method

crosstab Method

freqItems Method

sampleBy Method

DataFrameStatFunctions — Working With Statistic Functions

approxQuantile Method

bloomFilter Method

buildBloomFilter Internal Method

corr Method

countMinSketch Method

cov Method

crosstab Method

freqItems Method

sampleBy Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

`join` Operators

`crossJoin` Method

Type-Preserving Joins — `joinWith` Operators

`aggUntyped` Internal Method

`logicalPlan` Internal Method

Computing Aggregates Using Aggregate Column Expressions or Function Names — `agg` Operator

Creating DataFrame from Aggregate Expressions — `toDF` Internal Method

`aggregateNumericColumns` Internal Method

`pivot` Operator

`strToExpr` Internal Method

`alias` Method

Aggregates Over Subset Of or Whole Dataset — `agg` Operator

Untyped Grouping — `groupBy` Operator

Typed Grouping — `groupByKey` Operator

`name` Operator

Creating TypedColumn — `withInputType` Internal Method

`isin` Operator

Specifying Type Hint — `as` Operator

`name` Operator

Adding Column to Dataset — `withColumn` Method

Creating Column Instance For Catalyst Expression — `apply` Factory Method

`like` Operator

Defining Windowing Column (Analytic Clause) — `over` Operator

`cast` Operator

`generateAlias` Method

`named` Method

`approxQuantile` Method

`bloomFilter` Method

`buildBloomFilter` Internal Method

`corr` Method

`countMinSketch` Method

`cov` Method

`crosstab` Method

`freqItems` Method

`sampleBy` Method

`approxQuantile` Method

`bloomFilter` Method

`buildBloomFilter` Internal Method

`corr` Method

`countMinSketch` Method

`cov` Method

`crosstab` Method

`freqItems` Method

`sampleBy` Method