关注 spark技术分享,
撸spark源码 玩spark最佳实践

Multi-Dimensional Aggregation

Multi-Dimensional Aggregation

Multi-dimensional aggregate operators are enhanced variants of groupBy operator that allow you to create queries for subtotals, grand totals and superset of subtotals in one go.

Multi-dimensional aggregate operators are semantically equivalent to union operator (or SQL’s UNION ALL) to combine single grouping queries.

Note

It is assumed that using one of the operators is usually more efficient (than union and groupBy) as it gives more freedom for query optimization.

Table 1. Multi-dimensional Aggregate Operators
Operator Return Type Description

cube

RelationalGroupedDataset

Calculates subtotals and a grand total for every permutation of the columns specified.

rollup

RelationalGroupedDataset

Calculates subtotals and a grand total over (ordered) combination of groups.

Beside cube and rollup multi-dimensional aggregate operators, Spark SQL supports GROUPING SETS clause in SQL mode only.

Note
SQL’s GROUPING SETS is the most general aggregate “operator” and can generate the same dataset as using a simple groupBy, cube and rollup operators.

Tip
Review the examples per operator in the following sections.
Note
Support for multi-dimensional aggregate operators was added in [SPARK-6356] Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext.

rollup Operator

rollup multi-dimensional aggregate operator is an extension of groupBy operator that calculates subtotals and a grand total across specified group of n + 1 dimensions (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).

Note

rollup operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total.

Note
rollup operator is equivalent to GROUP BY ... WITH ROLLUP in SQL (which in turn is equivalent to GROUP BY ... GROUPING SETS ((a,b,c),(a,b),(a),()) when used with 3 columns: a, b, and c).

The ROLLUP, CUBE, and GROUPING SETS operators are extensions of the GROUP BY clause. The ROLLUP, CUBE, or GROUPING SETS operators can generate the same result set as when you use UNION ALL to combine single grouping queries; however, using one of the GROUP BY operators is usually more efficient.

References to the grouping columns or expressions are replaced by null values in result rows for grouping sets in which those columns do not appear.

From Summarizing Data Using ROLLUP in Microsoft’s TechNet:

The ROLLUP operator is useful in generating reports that contain subtotals and totals. (…​)
ROLLUP generates a result set that shows aggregates for a hierarchy of values in the selected columns.

From Hive’s Cubes and Rollups:

WITH ROLLUP is used with the GROUP BY only. ROLLUP clause is used with GROUP BY to compute the aggregate at the hierarchy levels of a dimension.

GROUP BY a, b, c with ROLLUP assumes that the hierarchy is “a” drilling down to “b” drilling down to “c”.

GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).

Note
Read up on ROLLUP in Hive’s LanguageManual in Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function.

The individual elements of a CUBE or ROLLUP clause may be either individual expressions, or sublists of elements in parentheses. In the latter case, the sublists are treated as single units for the purposes of generating the individual grouping sets.

Internally, rollup converts the Dataset into a DataFrame (i.e. uses RowEncoder as the encoder) and then creates a RelationalGroupedDataset (with RollupType group type).

Note
Rollup expression represents GROUP BY ... WITH ROLLUP in SQL in Spark’s Catalyst Expression tree (after AstBuilder parses a structured query with aggregation).

cube Operator

cube multi-dimensional aggregate operator is an extension of groupBy operator that allows calculating subtotals and a grand total across all combinations of specified group of n + 1 dimensions (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).

cube returns RelationalGroupedDataset that you can use to execute aggregate function or operator.

Note
cube is more than rollup operator, i.e. cube does rollup with aggregation over all the missing combinations given the columns.

GROUPING SETS SQL Clause

GROUPING SETS clause generates a dataset that is equivalent to union operator of multiple groupBy operators.

Internally, GROUPING SETS clause is parsed in withAggregation parsing handler (in AstBuilder) and becomes a GroupingSets logical operator internally.

Rollup GroupingSet with CodegenFallback Expression (for rollup Operator)

Rollup expression represents rollup operator in Spark’s Catalyst Expression tree (after AstBuilder parses a structured query with aggregation).

Note
GroupingSet is an Expression with CodegenFallback support.
赞(0) 打赏
未经允许不得转载:spark技术分享 » Multi-Dimensional Aggregation
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏