关注 spark技术分享,
撸spark源码 玩spark最佳实践

Basic Aggregation — Typed and Untyped Grouping Operators

Basic Aggregation — Typed and Untyped Grouping Operators

You can calculate aggregates over a group of rows in a Dataset using aggregate operators (possibly with aggregate functions).

Table 1. Aggregate Operators
Operator Return Type Description

agg

RelationalGroupedDataset

Aggregates with or without grouping (i.e. over an entire Dataset)

groupBy

RelationalGroupedDataset

Used for untyped aggregates using DataFrames. Grouping is described using column expressions or column names.

groupByKey

KeyValueGroupedDataset

Used for typed aggregates using Datasets with records grouped by a key-defining discriminator function.

Note
Aggregate functions without aggregate operators return a single value. If you want to find the aggregate values for each unique value (in a column), you should groupBy first (over this column) to build the groups.
Note

You can also use SparkSession to execute good ol’ SQL with GROUP BY should you prefer.

SQL or Dataset API’s operators go through the same query planning and optimizations, and have the same performance characteristic in the end.

Aggregates Over Subset Of or Whole Dataset — agg Operator

agg applies an aggregate function on a subset or the entire Dataset (i.e. considering the entire data set as one group).

Note
agg on a Dataset is simply a shortcut for groupBy().agg(…​).

agg can compute aggregate expressions on all the records in a Dataset.

Untyped Grouping — groupBy Operator

groupBy operator groups the rows in a Dataset by columns (as Column expressions or names).

groupBy gives a RelationalGroupedDataset to execute aggregate functions or operators.

Internally, groupBy resolves column names (possibly quoted) and creates a RelationalGroupedDataset (with groupType being GroupByType).

Note
The following uses the data setup as described in Test Setup section below.

Typed Grouping — groupByKey Operator

groupByKey groups records (of type T) by the input func and in the end returns a KeyValueGroupedDataset to apply aggregation to.

Note
groupByKey is Dataset‘s experimental API.

Test Setup

This is a setup for learning GroupedData. Paste it into Spark Shell using :paste.

  1. Cache the dataset so the following queries won’t load/recompute data over and over again.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Basic Aggregation — Typed and Untyped Grouping Operators
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏