关注 spark技术分享,
撸spark源码 玩spark最佳实践

RelationalGroupedDataset — Untyped Row-based Grouping

RelationalGroupedDataset — Untyped Row-based Grouping

RelationalGroupedDataset is an interface to calculate aggregates over groups of rows in a DataFrame.

Note
KeyValueGroupedDataset is used for typed aggregates over groups of custom Scala objects (not Rows).

RelationalGroupedDataset is a result of executing the following grouping operators:

Table 1. RelationalGroupedDataset’s Aggregate Operators
Operator Description

agg

avg

count

max

mean

min

pivot

  1. New in 2.4.0

Pivots on a column (with new columns per distinct value)

sum

Note

spark.sql.retainGroupColumns configuration property controls whether to retain columns used for aggregation or not (in RelationalGroupedDataset operators).

spark.sql.retainGroupColumns is enabled by default.

Computing Aggregates Using Aggregate Column Expressions or Function Names — agg Operator

agg creates a DataFrame with the rows being the result of executing grouping expressions (specified using columns or names) over row groups.

Note
You can use untyped or typed column expressions.

Internally, agg creates a DataFrame with Aggregate or Pivot logical operators.

Creating DataFrame from Aggregate Expressions — toDF Internal Method

Caution
FIXME

Internally, toDF branches off per group type.

Caution
FIXME

For PivotType, toDF creates a DataFrame with Pivot unary logical operator.

Note

toDF is used when the following RelationalGroupedDataset operators are used:

aggregateNumericColumns Internal Method

aggregateNumericColumns…​FIXME

Note
aggregateNumericColumns is used when the following RelationalGroupedDataset operators are used: mean, max, avg, min and sum.

Creating RelationalGroupedDataset Instance

RelationalGroupedDataset takes the following when created:

  • DataFrame

  • Grouping expressions

  • Group type (to indicate the “source” operator)

    • GroupByType for groupBy

    • CubeType

    • RollupType

    • PivotType

pivot Operator

  1. Selects distinct and sorted values on pivotColumn and calls the other pivot (that results in 3 extra “scanning” jobs)

  2. Preferred as more efficient because the unique values are aleady provided

  3. New in 2.4.0

pivot pivots on a pivotColumn column, i.e. adds new columns per distinct values in pivotColumn.

Note
pivot is only supported after groupBy operation.
Note
Only one pivot operation is supported on a RelationalGroupedDataset.

Important
Use pivot with a list of distinct values to pivot on so Spark does not have to compute the list itself (and run three extra “scanning” jobs).
spark sql pivot webui.png
Figure 1. pivot in web UI (Distinct Values Defined Explicitly)
spark sql pivot webui scanning jobs.png
Figure 2. pivot in web UI — Three Extra Scanning Jobs Due to Unspecified Distinct Values
Note
spark.sql.pivotMaxValues (default: 10000) controls the maximum number of (distinct) values that will be collected without error (when doing pivot without specifying the values for the pivot column).

Internally, pivot creates a RelationalGroupedDataset with PivotType group type and pivotColumn resolved using the DataFrame’s columns with values as Literal expressions.

Note

toDF internal method maps PivotType group type to a DataFrame with Pivot unary logical operator.

strToExpr Internal Method

strToExpr…​FIXME

Note
strToExpr is used exclusively when RelationalGroupedDataset is requested to agg with aggregation functions specified by name

alias Method

alias…​FIXME

Note
alias is used exclusively when RelationalGroupedDataset is requested to create a DataFrame from aggregate expressions.
赞(0) 打赏
未经允许不得转载:spark技术分享 » RelationalGroupedDataset — Untyped Row-based Grouping
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏