spark-sql-spark技术分享-第54页

UserDefinedFunction

UserDefinedFunction represents a user-defined function.

UserDefinedFunction is created when:

udf function is executed
UDFRegistration is requested to register a Scala function as a user-defined function (in FunctionRegistry)



import org.apache.spark.sql.functions.udf
scala> val lengthUDF = udf { s: String => s.length }
lengthUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))

scala> lengthUDF($"name")
res1: org.apache.spark.sql.Column = UDF(name)

import org.apache.spark.sql.functions.udf

scala> val lengthUDF = udf { s: String => s.length }

lengthUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))

scala> lengthUDF($"name")

res1: org.apache.spark.sql.Column = UDF(name)

UserDefinedFunction can also have a name.



val namedLengthUDF = lengthUDF.withName("lengthUDF")
scala> namedLengthUDF($"name")
res2: org.apache.spark.sql.Column = UDF:lengthUDF(name)

val namedLengthUDF = lengthUDF.withName("lengthUDF")

scala> namedLengthUDF($"name")

res2: org.apache.spark.sql.Column = UDF:lengthUDF(name)

UserDefinedFunction is nullable by default, but can be changed as non-nullable.



val nonNullableLengthUDF = lengthUDF.asNonNullable
scala> nonNullableLengthUDF.nullable
res1: Boolean = false

val nonNullableLengthUDF = lengthUDF.asNonNullable

scala> nonNullableLengthUDF.nullable

res1: Boolean = false

Executing UserDefinedFunction (Creating Column with ScalaUDF Expression) — `apply` Method



apply(exprs: Column*): Column

apply(exprs: Column*): Column

apply creates a Column with ScalaUDF expression.



import org.apache.spark.sql.functions.udf
scala> val lengthUDF = udf { s: String => s.length }
lengthUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))

scala> lengthUDF($"name")
res1: org.apache.spark.sql.Column = UDF(name)

import org.apache.spark.sql.functions.udf

scala> val lengthUDF = udf { s: String => s.length }

lengthUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))

scala> lengthUDF($"name")

res1: org.apache.spark.sql.Column = UDF(name)

Note	`apply` is used when…FIXME

Marking UserDefinedFunction as NonNullable — `asNonNullable` Method



asNonNullable(): UserDefinedFunction

asNonNullable(): UserDefinedFunction

asNonNullable…FIXME

Note	`asNonNullable` is used when…FIXME

Naming UserDefinedFunction — `withName` Method



withName(name: String): UserDefinedFunction

withName(name: String): UserDefinedFunction

withName…FIXME

Note	`withName` is used when…FIXME

Creating UserDefinedFunction Instance

UserDefinedFunction takes the following when created:

A Scala function (as Scala’s AnyRef)
Output data type
Input data types (if available)

UserDefinedFunction initializes the internal registries and counters.

UDFs — User-Defined Functions

User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets.

Important

Use the higher-level standard Column-based functions (with Dataset operators) whenever possible before reverting to developing user-defined functions since UDFs are a blackbox for Spark SQL and it cannot (and does not even try to) optimize them.

As Reynold Xin from the Apache Spark project has once said on Spark’s dev mailing list:

There are simple cases in which we can analyze the UDFs byte code and infer what it is doing, but it is pretty difficult to do in general.

Check out UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice if you want to know the internals.

You define a new UDF by defining a Scala function as an input parameter of udf function. It accepts Scala functions of up to 10 input parameters.



val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")

// Define a regular Scala function
val upper: String => String = _.toUpperCase

// Define a UDF that wraps the upper Scala function defined above
// You could also define the function in place, i.e. inside udf
// but separating Scala functions from Spark SQL's UDFs allows for easier testing
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)

// Apply the UDF to change the source dataset
scala> dataset.withColumn("upper", upperUDF('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
|  0|hello|HELLO|
|  1|world|WORLD|
+---+-----+-----+

val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")

// Define a regular Scala function

val upper: String => String = _.toUpperCase

// Define a UDF that wraps the upper Scala function defined above

// You could also define the function in place, i.e. inside udf

// but separating Scala functions from Spark SQL's UDFs allows for easier testing

import org.apache.spark.sql.functions.udf

val upperUDF = udf(upper)

// Apply the UDF to change the source dataset

scala> dataset.withColumn("upper", upperUDF('text)).show

+---+-----+-----+

| id| text|upper|

+---+-----+-----+

| 0|hello|HELLO|

| 1|world|WORLD|

+---+-----+-----+

You can register UDFs to use in SQL-based query expressions via UDFRegistration (that is available through SparkSession.udf attribute).



val spark: SparkSession = ...
scala> spark.udf.register("myUpper", (input: String) => input.toUpperCase)

val spark: SparkSession = ...

scala> spark.udf.register("myUpper", (input: String) => input.toUpperCase)

You can query for available standard and user-defined functions using the Catalog interface (that is available through SparkSession.catalog attribute).



val spark: SparkSession = ...
scala> spark.catalog.listFunctions.filter('name like "%upper%").show(false)
+-------+--------+-----------+-----------------------------------------------+-----------+
|name   |database|description|className                                      |isTemporary|
+-------+--------+-----------+-----------------------------------------------+-----------+
|myupper|null    |null       |null                                           |true       |
|upper  |null    |null       |org.apache.spark.sql.catalyst.expressions.Upper|true       |
+-------+--------+-----------+-----------------------------------------------+-----------+

val spark: SparkSession = ...

scala> spark.catalog.listFunctions.filter('name like "%upper%").show(false)

+-------+--------+-----------+-----------------------------------------------+-----------+

+-------+--------+-----------+-----------------------------------------------+-----------+

+-------+--------+-----------+-----------------------------------------------+-----------+

Note	UDFs play a vital role in Spark MLlib to define new Transformers that are function objects that transform `DataFrames` into `DataFrames` by introducing new columns.

udf Functions (in functions object)



udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction
...
udf[RT: TypeTag, A1: TypeTag, A2: TypeTag, A3: TypeTag, A4: TypeTag, A5: TypeTag, A6: TypeTag, A7: TypeTag, A8: TypeTag, A9: TypeTag, A10: TypeTag](f: Function10[A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, RT]): UserDefinedFunction

udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction

...

udf[RT: TypeTag, A1: TypeTag, A2: TypeTag, A3: TypeTag, A4: TypeTag, A5: TypeTag, A6: TypeTag, A7: TypeTag, A8: TypeTag, A9: TypeTag, A10: TypeTag](f: Function10[A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, RT]): UserDefinedFunction

org.apache.spark.sql.functions object comes with udf function to let you define a UDF for a Scala function f.



val df = Seq(
  (0, "hello"),
  (1, "world")).toDF("id", "text")

// Define a "regular" Scala function
// It's a clone of upper UDF
val toUpper: String => String = _.toUpperCase

import org.apache.spark.sql.functions.udf
val upper = udf(toUpper)

scala> df.withColumn("upper", upper('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
|  0|hello|HELLO|
|  1|world|WORLD|
+---+-----+-----+

// You could have also defined the UDF this way
val upperUDF = udf { s: String => s.toUpperCase }

// or even this way
val upperUDF = udf[String, String](_.toUpperCase)

scala> df.withColumn("upper", upperUDF('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
|  0|hello|HELLO|
|  1|world|WORLD|
+---+-----+-----+

val df = Seq(

(0, "hello"),

(1, "world")).toDF("id", "text")

// Define a "regular" Scala function

// It's a clone of upper UDF

val toUpper: String => String = _.toUpperCase

import org.apache.spark.sql.functions.udf

val upper = udf(toUpper)

scala> df.withColumn("upper", upper('text)).show

+---+-----+-----+

| id| text|upper|

+---+-----+-----+

| 0|hello|HELLO|

| 1|world|WORLD|

+---+-----+-----+

// You could have also defined the UDF this way

val upperUDF = udf { s: String => s.toUpperCase }

// or even this way

val upperUDF = udf[String, String](_.toUpperCase)

scala> df.withColumn("upper", upperUDF('text)).show

+---+-----+-----+

| id| text|upper|

+---+-----+-----+

| 0|hello|HELLO|

| 1|world|WORLD|

+---+-----+-----+

Tip	Define custom UDFs based on “standalone” Scala functions (e.g. `toUpperUDF`) so you can test the Scala functions using Scala way (without Spark SQL’s “noise”) and once they are defined reuse the UDFs in UnaryTransformers.

Window Aggregation Functions

2012-02-16admin阅读(1292)

Standard Functions for Window Aggregation (Window Functions)

Window aggregate functions (aka window functions or windowed aggregates) are functions that perform a calculation over a group of records called window that are in some relation to the current record (i.e. can be in the same partition or frame as the current row).

In other words, when executed, a window function computes a value for each and every row in a window (per window specification).

Note	Window functions are also called over functions due to how they are applied using over operator.

Spark SQL supports three kinds of window functions:

ranking functions
analytic functions
aggregate functions

Table 1. Window Aggregate Functions in Spark SQL
	Function	Purpose
Ranking functions	rank
	dense_rank
	percent_rank
	ntile
	row_number
Analytic functions	cume_dist
	lag
	lead

For aggregate functions, you can use the existing aggregate functions as window functions, e.g. sum, avg, min, max and count.



// Borrowed from 3.5. Window Functions in PostgreSQL documentation
// Example of window functions using Scala API
//
case class Salary(depName: String, empNo: Long, salary: Long)
val empsalary = Seq(
  Salary("sales", 1, 5000),
  Salary("personnel", 2, 3900),
  Salary("sales", 3, 4800),
  Salary("sales", 4, 4800),
  Salary("personnel", 5, 3500),
  Salary("develop", 7, 4200),
  Salary("develop", 8, 6000),
  Salary("develop", 9, 4500),
  Salary("develop", 10, 5200),
  Salary("develop", 11, 5200)).toDS

import org.apache.spark.sql.expressions.Window
// Windows are partitions of deptName
scala> val byDepName = Window.partitionBy('depName)
byDepName: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@1a711314

scala> empsalary.withColumn("avg", avg('salary) over byDepName).show
+---------+-----+------+-----------------+
|  depName|empNo|salary|              avg|
+---------+-----+------+-----------------+
|  develop|    7|  4200|           5020.0|
|  develop|    8|  6000|           5020.0|
|  develop|    9|  4500|           5020.0|
|  develop|   10|  5200|           5020.0|
|  develop|   11|  5200|           5020.0|
|    sales|    1|  5000|4866.666666666667|
|    sales|    3|  4800|4866.666666666667|
|    sales|    4|  4800|4866.666666666667|
|personnel|    2|  3900|           3700.0|
|personnel|    5|  3500|           3700.0|
+---------+-----+------+-----------------+

// Borrowed from 3.5. Window Functions in PostgreSQL documentation

// Example of window functions using Scala API

case class Salary(depName: String, empNo: Long, salary: Long)

val empsalary = Seq(

Salary("sales", 1, 5000),

Salary("personnel", 2, 3900),

Salary("sales", 3, 4800),

Salary("sales", 4, 4800),

Salary("personnel", 5, 3500),

Salary("develop", 7, 4200),

Salary("develop", 8, 6000),

Salary("develop", 9, 4500),

Salary("develop", 10, 5200),

Salary("develop", 11, 5200)).toDS

import org.apache.spark.sql.expressions.Window

// Windows are partitions of deptName

scala> val byDepName = Window.partitionBy('depName)

byDepName: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@1a711314

scala> empsalary.withColumn("avg", avg('salary) over byDepName).show

+---------+-----+------+-----------------+

+---------+-----+------+-----------------+

| develop| 7| 4200| 5020.0|

| develop| 8| 6000| 5020.0|

| develop| 9| 4500| 5020.0|

| develop| 10| 5200| 5020.0|

| develop| 11| 5200| 5020.0|

| sales| 1| 5000|4866.666666666667|

| sales| 3| 4800|4866.666666666667|

| sales| 4| 4800|4866.666666666667|

|personnel| 2| 3900| 3700.0|

|personnel| 5| 3500| 3700.0|

+---------+-----+------+-----------------+

You describe a window using the convenient factory methods in Window object that create a window specification that you can further refine with partitioning, ordering, and frame boundaries.

After you describe a window you can apply window aggregate functions like ranking functions (e.g. RANK), analytic functions (e.g. LAG), and the regular aggregate functions, e.g. sum, avg, max.

Note	Window functions are supported in structured queries using SQL and Column-based expressions.

Although similar to aggregate functions, a window function does not group rows into a single output row and retains their separate identities. A window function can access rows that are linked to the current row.

Note	The main difference between window aggregate functions and aggregate functions with grouping operators is that the former calculate values for every row in a window while the latter gives you at most the number of input rows, one value per group.

Tip	See Examples section in this document.

You can mark a function window by OVER clause after a function in SQL, e.g. avg(revenue) OVER (…) or over method on a function in the Dataset API, e.g. rank().over(…).

Note	Window functions belong to Window functions group in Spark’s Scala API.

Note	Window-based framework is available as an experimental feature since Spark 1.4.0.

Window object

Window object provides functions to define windows (as WindowSpec instances).

Window object lives in org.apache.spark.sql.expressions package. Import it to use Window functions.



import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.expressions.Window

There are two families of the functions available in Window object that create WindowSpec instance for one or many Column instances:

partitionBy
orderBy

Partitioning Records — `partitionBy` Methods



partitionBy(colName: String, colNames: String*): WindowSpec
partitionBy(cols: Column*): WindowSpec

partitionBy(colName: String, colNames: String*): WindowSpec

partitionBy(cols: Column*): WindowSpec

partitionBy creates an instance of WindowSpec with partition expression(s) defined for one or more columns.



// partition records into two groups
// * tokens starting with "h"
// * others
val byHTokens = Window.partitionBy('token startsWith "h")

// count the sum of ids in each group
val result = tokens.select('*, sum('id) over byHTokens as "sum over h tokens").orderBy('id)

scala> .show
+---+-----+-----------------+
| id|token|sum over h tokens|
+---+-----+-----------------+
|  0|hello|                4|
|  1|henry|                4|
|  2|  and|                2|
|  3|harry|                4|
+---+-----+-----------------+

// partition records into two groups

// * tokens starting with "h"

// * others

val byHTokens = Window.partitionBy('token startsWith "h")

// count the sum of ids in each group

val result = tokens.select('*, sum('id) over byHTokens as "sum over h tokens").orderBy('id)

scala> .show

+---+-----+-----------------+

| id|token|sum over h tokens|

+---+-----+-----------------+

| 0|hello| 4|

| 1|henry| 4|

| 2| and| 2|

| 3|harry| 4|

+---+-----+-----------------+

Ordering in Windows — `orderBy` Methods



orderBy(colName: String, colNames: String*): WindowSpec
orderBy(cols: Column*): WindowSpec

orderBy(colName: String, colNames: String*): WindowSpec

orderBy(cols: Column*): WindowSpec

orderBy allows you to control the order of records in a window.



import org.apache.spark.sql.expressions.Window
val byDepnameSalaryDesc = Window.partitionBy('depname).orderBy('salary desc)

// a numerical rank within the current row's partition for each distinct ORDER BY value
scala> val rankByDepname = rank().over(byDepnameSalaryDesc)
rankByDepname: org.apache.spark.sql.Column = RANK() OVER (PARTITION BY depname ORDER BY salary DESC UnspecifiedFrame)

scala> empsalary.select('*, rankByDepname as 'rank).show
+---------+-----+------+----+
|  depName|empNo|salary|rank|
+---------+-----+------+----+
|  develop|    8|  6000|   1|
|  develop|   10|  5200|   2|
|  develop|   11|  5200|   2|
|  develop|    9|  4500|   4|
|  develop|    7|  4200|   5|
|    sales|    1|  5000|   1|
|    sales|    3|  4800|   2|
|    sales|    4|  4800|   2|
|personnel|    2|  3900|   1|
|personnel|    5|  3500|   2|
+---------+-----+------+----+

import org.apache.spark.sql.expressions.Window

val byDepnameSalaryDesc = Window.partitionBy('depname).orderBy('salary desc)

// a numerical rank within the current row's partition for each distinct ORDER BY value

scala> val rankByDepname = rank().over(byDepnameSalaryDesc)

rankByDepname: org.apache.spark.sql.Column = RANK() OVER (PARTITION BY depname ORDER BY salary DESC UnspecifiedFrame)

scala> empsalary.select('*, rankByDepname as 'rank).show

+---------+-----+------+----+

+---------+-----+------+----+

| develop| 8| 6000| 1|

| develop| 10| 5200| 2|

| develop| 11| 5200| 2|

| develop| 9| 4500| 4|

| develop| 7| 4200| 5|

| sales| 1| 5000| 1|

| sales| 3| 4800| 2|

| sales| 4| 4800| 2|

|personnel| 2| 3900| 1|

|personnel| 5| 3500| 2|

+---------+-----+------+----+

`rangeBetween` Method



rangeBetween(start: Long, end: Long): WindowSpec

rangeBetween(start: Long, end: Long): WindowSpec

rangeBetween creates a WindowSpec with the frame boundaries from start (inclusive) to end (inclusive).

Note	It is recommended to use `Window.unboundedPreceding`, `Window.unboundedFollowing` and `Window.currentRow` to describe the frame boundaries when a frame is unbounded preceding, unbounded following and at current row, respectively.



import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.WindowSpec
val spec: WindowSpec = Window.rangeBetween(Window.unboundedPreceding, Window.currentRow)

import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.expressions.WindowSpec

val spec: WindowSpec = Window.rangeBetween(Window.unboundedPreceding, Window.currentRow)

Internally, rangeBetween creates a WindowSpec with SpecifiedWindowFrame and RangeFrame type.

Frame

At its core, a window function calculates a return value for every input row of a table based on a group of rows, called the frame. Every input row can have a unique frame associated with it.

When you define a frame you have to specify three components of a frame specification – the start and end boundaries, and the type.

Types of boundaries (two positions and three offsets):

UNBOUNDED PRECEDING – the first row of the partition
UNBOUNDED FOLLOWING – the last row of the partition
CURRENT ROW
<value> PRECEDING
<value> FOLLOWING

Offsets specify the offset from the current input row.

Types of frames:

ROW – based on physical offsets from the position of the current input row
RANGE – based on logical offsets from the position of the current input row

In the current implementation of WindowSpec you can use two methods to define a frame:

rowsBetween
rangeBetween

See WindowSpec for their coverage.

Window Operators in SQL Queries

The grammar of windows operators in SQL accepts the following:

CLUSTER BY or PARTITION BY or DISTRIBUTE BY for partitions,
ORDER BY or SORT BY for sorting order,
RANGE, ROWS, RANGE BETWEEN, and ROWS BETWEEN for window frame types,
UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW for frame bounds.

Tip	Consult withWindows helper in `AstBuilder`.

Examples

Top N per Group

Top N per Group is useful when you need to compute the first and second best-sellers in category.

Note	This example is borrowed from an excellent article Introducing Window Functions in Spark SQL.

Table 2. Table PRODUCT_REVENUE
product	category	revenue
Thin	cell phone	6000
Normal	tablet	1500
Mini	tablet	5500
Ultra thin	cell phone	5000
Very thin	cell phone	6000
Big	tablet	2500
Bendable	cell phone	3000
Foldable	cell phone	3000
Pro	tablet	4500
Pro2	tablet	6500

Question: What are the best-selling and the second best-selling products in every category?



val dataset = Seq(
  ("Thin",       "cell phone", 6000),
  ("Normal",     "tablet",     1500),
  ("Mini",       "tablet",     5500),
  ("Ultra thin", "cell phone", 5000),
  ("Very thin",  "cell phone", 6000),
  ("Big",        "tablet",     2500),
  ("Bendable",   "cell phone", 3000),
  ("Foldable",   "cell phone", 3000),
  ("Pro",        "tablet",     4500),
  ("Pro2",       "tablet",     6500))
  .toDF("product", "category", "revenue")

scala> dataset.show
+----------+----------+-------+
|   product|  category|revenue|
+----------+----------+-------+
|      Thin|cell phone|   6000|
|    Normal|    tablet|   1500|
|      Mini|    tablet|   5500|
|Ultra thin|cell phone|   5000|
| Very thin|cell phone|   6000|
|       Big|    tablet|   2500|
|  Bendable|cell phone|   3000|
|  Foldable|cell phone|   3000|
|       Pro|    tablet|   4500|
|      Pro2|    tablet|   6500|
+----------+----------+-------+

scala> data.where('category === "tablet").show
+-------+--------+-------+
|product|category|revenue|
+-------+--------+-------+
| Normal|  tablet|   1500|
|   Mini|  tablet|   5500|
|    Big|  tablet|   2500|
|    Pro|  tablet|   4500|
|   Pro2|  tablet|   6500|
+-------+--------+-------+

val dataset = Seq(

("Thin", "cell phone", 6000),

("Normal", "tablet", 1500),

("Mini", "tablet", 5500),

("Ultra thin", "cell phone", 5000),

("Very thin", "cell phone", 6000),

("Big", "tablet", 2500),

("Bendable", "cell phone", 3000),

("Foldable", "cell phone", 3000),

("Pro", "tablet", 4500),

("Pro2", "tablet", 6500))

.toDF("product", "category", "revenue")

scala> dataset.show

+----------+----------+-------+

| product| category|revenue|

+----------+----------+-------+

| Thin|cell phone| 6000|

| Normal| tablet| 1500|

| Mini| tablet| 5500|

|Ultra thin|cell phone| 5000|

| Very thin|cell phone| 6000|

| Big| tablet| 2500|

| Bendable|cell phone| 3000|

| Foldable|cell phone| 3000|

| Pro| tablet| 4500|

| Pro2| tablet| 6500|

+----------+----------+-------+

scala> data.where('category === "tablet").show

+-------+--------+-------+

|product|category|revenue|

+-------+--------+-------+

| Normal| tablet| 1500|

| Mini| tablet| 5500|

| Big| tablet| 2500|

| Pro| tablet| 4500|

| Pro2| tablet| 6500|

+-------+--------+-------+

The question boils down to ranking products in a category based on their revenue, and to pick the best selling and the second best-selling products based the ranking.



import org.apache.spark.sql.expressions.Window
val overCategory = Window.partitionBy('category).orderBy('revenue.desc)

val ranked = data.withColumn("rank", dense_rank.over(overCategory))

scala> ranked.show
+----------+----------+-------+----+
|   product|  category|revenue|rank|
+----------+----------+-------+----+
|      Pro2|    tablet|   6500|   1|
|      Mini|    tablet|   5500|   2|
|       Pro|    tablet|   4500|   3|
|       Big|    tablet|   2500|   4|
|    Normal|    tablet|   1500|   5|
|      Thin|cell phone|   6000|   1|
| Very thin|cell phone|   6000|   1|
|Ultra thin|cell phone|   5000|   2|
|  Bendable|cell phone|   3000|   3|
|  Foldable|cell phone|   3000|   3|
+----------+----------+-------+----+

scala> ranked.where('rank <= 2).show
+----------+----------+-------+----+
|   product|  category|revenue|rank|
+----------+----------+-------+----+
|      Pro2|    tablet|   6500|   1|
|      Mini|    tablet|   5500|   2|
|      Thin|cell phone|   6000|   1|
| Very thin|cell phone|   6000|   1|
|Ultra thin|cell phone|   5000|   2|
+----------+----------+-------+----+

import org.apache.spark.sql.expressions.Window

val overCategory = Window.partitionBy('category).orderBy('revenue.desc)

val ranked = data.withColumn("rank", dense_rank.over(overCategory))

scala> ranked.show

+----------+----------+-------+----+

+----------+----------+-------+----+

| Pro2| tablet| 6500| 1|

| Mini| tablet| 5500| 2|

| Pro| tablet| 4500| 3|

| Big| tablet| 2500| 4|

| Normal| tablet| 1500| 5|

| Thin|cell phone| 6000| 1|

| Very thin|cell phone| 6000| 1|

|Ultra thin|cell phone| 5000| 2|

| Bendable|cell phone| 3000| 3|

| Foldable|cell phone| 3000| 3|

+----------+----------+-------+----+

scala> ranked.where('rank <= 2).show

+----------+----------+-------+----+

+----------+----------+-------+----+

| Pro2| tablet| 6500| 1|

| Mini| tablet| 5500| 2|

| Thin|cell phone| 6000| 1|

| Very thin|cell phone| 6000| 1|

|Ultra thin|cell phone| 5000| 2|

+----------+----------+-------+----+

Revenue Difference per Category

Note	This example is the 2nd example from an excellent article Introducing Window Functions in Spark SQL.



import org.apache.spark.sql.expressions.Window
val reveDesc = Window.partitionBy('category).orderBy('revenue.desc)
val reveDiff = max('revenue).over(reveDesc) - 'revenue

scala> data.select('*, reveDiff as 'revenue_diff).show
+----------+----------+-------+------------+
|   product|  category|revenue|revenue_diff|
+----------+----------+-------+------------+
|      Pro2|    tablet|   6500|           0|
|      Mini|    tablet|   5500|        1000|
|       Pro|    tablet|   4500|        2000|
|       Big|    tablet|   2500|        4000|
|    Normal|    tablet|   1500|        5000|
|      Thin|cell phone|   6000|           0|
| Very thin|cell phone|   6000|           0|
|Ultra thin|cell phone|   5000|        1000|
|  Bendable|cell phone|   3000|        3000|
|  Foldable|cell phone|   3000|        3000|
+----------+----------+-------+------------+

import org.apache.spark.sql.expressions.Window

val reveDesc = Window.partitionBy('category).orderBy('revenue.desc)

val reveDiff = max('revenue).over(reveDesc) - 'revenue

scala> data.select('*, reveDiff as 'revenue_diff).show

+----------+----------+-------+------------+

+----------+----------+-------+------------+

| Pro2| tablet| 6500| 0|

| Mini| tablet| 5500| 1000|

| Pro| tablet| 4500| 2000|

| Big| tablet| 2500| 4000|

| Normal| tablet| 1500| 5000|

| Thin|cell phone| 6000| 0|

| Very thin|cell phone| 6000| 0|

|Ultra thin|cell phone| 5000| 1000|

| Bendable|cell phone| 3000| 3000|

| Foldable|cell phone| 3000| 3000|

+----------+----------+-------+------------+

Difference on Column

Compute a difference between values in rows in a column.



val pairs = for {
  x <- 1 to 5
  y <- 1 to 2
} yield (x, 10 * x * y)
val ds = pairs.toDF("ns", "tens")

scala> ds.show
+---+----+
| ns|tens|
+---+----+
|  1|  10|
|  1|  20|
|  2|  20|
|  2|  40|
|  3|  30|
|  3|  60|
|  4|  40|
|  4|  80|
|  5|  50|
|  5| 100|
+---+----+

import org.apache.spark.sql.expressions.Window
val overNs = Window.partitionBy('ns).orderBy('tens)
val diff = lead('tens, 1).over(overNs)

scala> ds.withColumn("diff", diff - 'tens).show
+---+----+----+
| ns|tens|diff|
+---+----+----+
|  1|  10|  10|
|  1|  20|null|
|  3|  30|  30|
|  3|  60|null|
|  5|  50|  50|
|  5| 100|null|
|  4|  40|  40|
|  4|  80|null|
|  2|  20|  20|
|  2|  40|null|
+---+----+----+

val pairs = for {

x <- 1 to 5

y <- 1 to 2

} yield (x, 10 * x * y)

val ds = pairs.toDF("ns", "tens")

scala> ds.show

+---+----+

| ns|tens|

+---+----+

| 1| 10|

| 1| 20|

| 2| 20|

| 2| 40|

| 3| 30|

| 3| 60|

| 4| 40|

| 4| 80|

| 5| 50|

| 5| 100|

+---+----+

import org.apache.spark.sql.expressions.Window

val overNs = Window.partitionBy('ns).orderBy('tens)

val diff = lead('tens, 1).over(overNs)

scala> ds.withColumn("diff", diff - 'tens).show

+---+----+----+

| ns|tens|diff|

+---+----+----+

| 1| 10| 10|

| 1| 20|null|

| 3| 30| 30|

| 3| 60|null|

| 5| 50| 50|

| 5| 100|null|

| 4| 40| 40|

| 4| 80|null|

| 2| 20| 20|

| 2| 40|null|

+---+----+----+

Please note that Why do Window functions fail with “Window function X does not take a frame specification”?

The key here is to remember that DataFrames are RDDs under the covers and hence aggregation like grouping by a key in DataFrames is RDD’s groupBy (or worse, reduceByKey or aggregateByKey transformations).

Running Total

The running total is the sum of all previous lines including the current one.



val sales = Seq(
  (0, 0, 0, 5),
  (1, 0, 1, 3),
  (2, 0, 2, 1),
  (3, 1, 0, 2),
  (4, 2, 0, 8),
  (5, 2, 2, 8))
  .toDF("id", "orderID", "prodID", "orderQty")

scala> sales.show
+---+-------+------+--------+
| id|orderID|prodID|orderQty|
+---+-------+------+--------+
|  0|      0|     0|       5|
|  1|      0|     1|       3|
|  2|      0|     2|       1|
|  3|      1|     0|       2|
|  4|      2|     0|       8|
|  5|      2|     2|       8|
+---+-------+------+--------+

val orderedByID = Window.orderBy('id)

val totalQty = sum('orderQty).over(orderedByID).as('running_total)
val salesTotalQty = sales.select('*, totalQty).orderBy('id)

scala> salesTotalQty.show
16/04/10 23:01:52 WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+-------+------+--------+-------------+
| id|orderID|prodID|orderQty|running_total|
+---+-------+------+--------+-------------+
|  0|      0|     0|       5|            5|
|  1|      0|     1|       3|            8|
|  2|      0|     2|       1|            9|
|  3|      1|     0|       2|           11|
|  4|      2|     0|       8|           19|
|  5|      2|     2|       8|           27|
+---+-------+------+--------+-------------+

val byOrderId = orderedByID.partitionBy('orderID)
val totalQtyPerOrder = sum('orderQty).over(byOrderId).as('running_total_per_order)
val salesTotalQtyPerOrder = sales.select('*, totalQtyPerOrder).orderBy('id)

scala> salesTotalQtyPerOrder.show
+---+-------+------+--------+-----------------------+
| id|orderID|prodID|orderQty|running_total_per_order|
+---+-------+------+--------+-----------------------+
|  0|      0|     0|       5|                      5|
|  1|      0|     1|       3|                      8|
|  2|      0|     2|       1|                      9|
|  3|      1|     0|       2|                      2|
|  4|      2|     0|       8|                      8|
|  5|      2|     2|       8|                     16|
+---+-------+------+--------+-----------------------+

val sales = Seq(

(0, 0, 0, 5),

(1, 0, 1, 3),

(2, 0, 2, 1),

(3, 1, 0, 2),

(4, 2, 0, 8),

(5, 2, 2, 8))

.toDF("id", "orderID", "prodID", "orderQty")

scala> sales.show

+---+-------+------+--------+

+---+-------+------+--------+

| 0| 0| 0| 5|

| 1| 0| 1| 3|

| 2| 0| 2| 1|

| 3| 1| 0| 2|

| 4| 2| 0| 8|

| 5| 2| 2| 8|

+---+-------+------+--------+

val orderedByID = Window.orderBy('id)

val totalQty = sum('orderQty).over(orderedByID).as('running_total)

val salesTotalQty = sales.select('*, totalQty).orderBy('id)

scala> salesTotalQty.show

16/04/10 23:01:52 WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

+---+-------+------+--------+-------------+

+---+-------+------+--------+-------------+

| 0| 0| 0| 5| 5|

| 1| 0| 1| 3| 8|

| 2| 0| 2| 1| 9|

| 3| 1| 0| 2| 11|

| 4| 2| 0| 8| 19|

| 5| 2| 2| 8| 27|

+---+-------+------+--------+-------------+

val byOrderId = orderedByID.partitionBy('orderID)

val totalQtyPerOrder = sum('orderQty).over(byOrderId).as('running_total_per_order)

val salesTotalQtyPerOrder = sales.select('*, totalQtyPerOrder).orderBy('id)

scala> salesTotalQtyPerOrder.show

+---+-------+------+--------+-----------------------+

+---+-------+------+--------+-----------------------+

| 0| 0| 0| 5| 5|

| 1| 0| 1| 3| 8|

| 2| 0| 2| 1| 9|

| 3| 1| 0| 2| 2|

| 4| 2| 0| 8| 8|

| 5| 2| 2| 8| 16|

+---+-------+------+--------+-----------------------+

Calculate rank of row

See “Explaining” Query Plans of Windows for an elaborate example.

Interval data type for Date and Timestamp types

See [SPARK-8943] CalendarIntervalType for time intervals.

With the Interval data type, you could use intervals as values specified in <value> PRECEDING and <value> FOLLOWING for RANGE frame. It is specifically suited for time-series analysis with window functions.

Accessing values of earlier rows

FIXME What’s the value of rows before current one?

Moving Average

Cumulative Aggregates

Eg. cumulative sum

User-defined aggregate functions

See [SPARK-3947] Support Scala/Java UDAF.

With the window function support, you could use user-defined aggregate functions as window functions.

“Explaining” Query Plans of Windows



import org.apache.spark.sql.expressions.Window
val byDepnameSalaryDesc = Window.partitionBy('depname).orderBy('salary desc)

scala> val rankByDepname = rank().over(byDepnameSalaryDesc)
rankByDepname: org.apache.spark.sql.Column = RANK() OVER (PARTITION BY depname ORDER BY salary DESC UnspecifiedFrame)

// empsalary defined at the top of the page
scala> empsalary.select('*, rankByDepname as 'rank).explain(extended = true)
== Parsed Logical Plan ==
'Project [*, rank() windowspecdefinition('depname, 'salary DESC, UnspecifiedFrame) AS rank#9]
+- LocalRelation [depName#5, empNo#6L, salary#7L]

== Analyzed Logical Plan ==
depName: string, empNo: bigint, salary: bigint, rank: int
Project [depName#5, empNo#6L, salary#7L, rank#9]
+- Project [depName#5, empNo#6L, salary#7L, rank#9, rank#9]
   +- Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]
      +- Project [depName#5, empNo#6L, salary#7L]
         +- LocalRelation [depName#5, empNo#6L, salary#7L]

== Optimized Logical Plan ==
Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]
+- LocalRelation [depName#5, empNo#6L, salary#7L]

== Physical Plan ==
Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]
+- *Sort [depname#5 ASC, salary#7L DESC], false, 0
   +- Exchange hashpartitioning(depname#5, 200)
      +- LocalTableScan [depName#5, empNo#6L, salary#7L]

import org.apache.spark.sql.expressions.Window

val byDepnameSalaryDesc = Window.partitionBy('depname).orderBy('salary desc)

scala> val rankByDepname = rank().over(byDepnameSalaryDesc)

rankByDepname: org.apache.spark.sql.Column = RANK() OVER (PARTITION BY depname ORDER BY salary DESC UnspecifiedFrame)

// empsalary defined at the top of the page

scala> empsalary.select('*, rankByDepname as 'rank).explain(extended = true)

== Parsed Logical Plan ==

'Project [*, rank() windowspecdefinition('depname, 'salary DESC, UnspecifiedFrame) AS rank#9]

+- LocalRelation [depName#5, empNo#6L, salary#7L]

== Analyzed Logical Plan ==

depName: string, empNo: bigint, salary: bigint, rank: int

Project [depName#5, empNo#6L, salary#7L, rank#9]

+- Project [depName#5, empNo#6L, salary#7L, rank#9, rank#9]

+- Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]

+- Project [depName#5, empNo#6L, salary#7L]

+- LocalRelation [depName#5, empNo#6L, salary#7L]

== Optimized Logical Plan ==

Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]

+- LocalRelation [depName#5, empNo#6L, salary#7L]

== Physical Plan ==

Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]

+- *Sort [depname#5 ASC, salary#7L DESC], false, 0

+- Exchange hashpartitioning(depname#5, 200)

+- LocalTableScan [depName#5, empNo#6L, salary#7L]

`lag` Window Function



lag(e: Column, offset: Int): Column
lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column
lag(e: Column, offset: Int, defaultValue: Any): Column

lag(e: Column, offset: Int): Column

lag(columnName: String, offset: Int): Column

lag(columnName: String, offset: Int, defaultValue: Any): Column

lag(e: Column, offset: Int, defaultValue: Any): Column

lag returns the value in e / columnName column that is offset records before the current record. lag returns null value if the number of records in a window partition is less than offset or defaultValue.



val buckets = spark.range(9).withColumn("bucket", 'id % 3)
// Make duplicates
val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("lag", lag('id, 1) over windowSpec).show
+---+------+----+
| id|bucket| lag|
+---+------+----+
|  0|     0|null|
|  3|     0|   0|
|  6|     0|   3|
|  1|     1|null|
|  4|     1|   1|
|  7|     1|   4|
|  2|     2|null|
|  5|     2|   2|
|  8|     2|   5|
+---+------+----+

scala> dataset.withColumn("lag", lag('id, 2, "<default_value>") over windowSpec).show
+---+------+----+
| id|bucket| lag|
+---+------+----+
|  0|     0|null|
|  3|     0|null|
|  6|     0|   0|
|  1|     1|null|
|  4|     1|null|
|  7|     1|   1|
|  2|     2|null|
|  5|     2|null|
|  8|     2|   2|
+---+------+----+

val buckets = spark.range(9).withColumn("bucket", 'id % 3)

// Make duplicates

val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy('bucket).orderBy('id)

scala> dataset.withColumn("lag", lag('id, 1) over windowSpec).show

+---+------+----+

| id|bucket| lag|

+---+------+----+

| 0| 0|null|

| 3| 0| 0|

| 6| 0| 3|

| 1| 1|null|

| 4| 1| 1|

| 7| 1| 4|

| 2| 2|null|

| 5| 2| 2|

| 8| 2| 5|

+---+------+----+

scala> dataset.withColumn("lag", lag('id, 2, "<default_value>") over windowSpec).show

+---+------+----+

| id|bucket| lag|

+---+------+----+

| 0| 0|null|

| 3| 0|null|

| 6| 0| 0|

| 1| 1|null|

| 4| 1|null|

| 7| 1| 1|

| 2| 2|null|

| 5| 2|null|

| 8| 2| 2|

+---+------+----+

Caution

FIXME It looks like lag with a default value has a bug — the default value’s not used at all.

`lead` Window Function



lead(columnName: String, offset: Int): Column
lead(e: Column, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column
lead(e: Column, offset: Int, defaultValue: Any): Column

lead(columnName: String, offset: Int): Column

lead(e: Column, offset: Int): Column

lead(columnName: String, offset: Int, defaultValue: Any): Column

lead(e: Column, offset: Int, defaultValue: Any): Column

lead returns the value that is offset records after the current records, and defaultValue if there is less than offset records after the current record. lag returns null value if the number of records in a window partition is less than offset or defaultValue.



val buckets = spark.range(9).withColumn("bucket", 'id % 3)
// Make duplicates
val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("lead", lead('id, 1) over windowSpec).show
+---+------+----+
| id|bucket|lead|
+---+------+----+
|  0|     0|   0|
|  0|     0|   3|
|  3|     0|   3|
|  3|     0|   6|
|  6|     0|   6|
|  6|     0|null|
|  1|     1|   1|
|  1|     1|   4|
|  4|     1|   4|
|  4|     1|   7|
|  7|     1|   7|
|  7|     1|null|
|  2|     2|   2|
|  2|     2|   5|
|  5|     2|   5|
|  5|     2|   8|
|  8|     2|   8|
|  8|     2|null|
+---+------+----+

scala> dataset.withColumn("lead", lead('id, 2, "<default_value>") over windowSpec).show
+---+------+----+
| id|bucket|lead|
+---+------+----+
|  0|     0|   3|
|  0|     0|   3|
|  3|     0|   6|
|  3|     0|   6|
|  6|     0|null|
|  6|     0|null|
|  1|     1|   4|
|  1|     1|   4|
|  4|     1|   7|
|  4|     1|   7|
|  7|     1|null|
|  7|     1|null|
|  2|     2|   5|
|  2|     2|   5|
|  5|     2|   8|
|  5|     2|   8|
|  8|     2|null|
|  8|     2|null|
+---+------+----+

val buckets = spark.range(9).withColumn("bucket", 'id % 3)

// Make duplicates

val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy('bucket).orderBy('id)

scala> dataset.withColumn("lead", lead('id, 1) over windowSpec).show

+---+------+----+

| id|bucket|lead|

+---+------+----+

| 0| 0| 0|

| 0| 0| 3|

| 3| 0| 3|

| 3| 0| 6|

| 6| 0| 6|

| 6| 0|null|

| 1| 1| 1|

| 1| 1| 4|

| 4| 1| 4|

| 4| 1| 7|

| 7| 1| 7|

| 7| 1|null|

| 2| 2| 2|

| 2| 2| 5|

| 5| 2| 5|

| 5| 2| 8|

| 8| 2| 8|

| 8| 2|null|

+---+------+----+

scala> dataset.withColumn("lead", lead('id, 2, "<default_value>") over windowSpec).show

+---+------+----+

| id|bucket|lead|

+---+------+----+

| 0| 0| 3|

| 3| 0| 6|

| 6| 0|null|

| 1| 1| 4|

| 4| 1| 7|

| 7| 1|null|

| 2| 2| 5|

| 5| 2| 8|

| 8| 2|null|

+---+------+----+

Caution

FIXME It looks like lead with a default value has a bug — the default value’s not used at all.

Cumulative Distribution of Records Across Window Partitions — `cume_dist` Window Function



cume_dist(): Column

cume_dist(): Column

cume_dist computes the cumulative distribution of the records in window partitions. This is equivalent to SQL’s CUME_DIST function.



val buckets = spark.range(9).withColumn("bucket", 'id % 3)
// Make duplicates
val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("cume_dist", cume_dist over windowSpec).show
+---+------+------------------+
| id|bucket|         cume_dist|
+---+------+------------------+
|  0|     0|0.3333333333333333|
|  3|     0|0.6666666666666666|
|  6|     0|               1.0|
|  1|     1|0.3333333333333333|
|  4|     1|0.6666666666666666|
|  7|     1|               1.0|
|  2|     2|0.3333333333333333|
|  5|     2|0.6666666666666666|
|  8|     2|               1.0|
+---+------+------------------+

val buckets = spark.range(9).withColumn("bucket", 'id % 3)

// Make duplicates

val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy('bucket).orderBy('id)

scala> dataset.withColumn("cume_dist", cume_dist over windowSpec).show

+---+------+------------------+

| id|bucket| cume_dist|

+---+------+------------------+

| 0| 0|0.3333333333333333|

| 3| 0|0.6666666666666666|

| 6| 0| 1.0|

| 1| 1|0.3333333333333333|

| 4| 1|0.6666666666666666|

| 7| 1| 1.0|

| 2| 2|0.3333333333333333|

| 5| 2|0.6666666666666666|

| 8| 2| 1.0|

+---+------+------------------+

Sequential numbering per window partition — `row_number` Window Function



row_number(): Column

row_number(): Column

row_number returns a sequential number starting at 1 within a window partition.



val buckets = spark.range(9).withColumn("bucket", 'id % 3)
// Make duplicates
val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("row_number", row_number() over windowSpec).show
+---+------+----------+
| id|bucket|row_number|
+---+------+----------+
|  0|     0|         1|
|  0|     0|         2|
|  3|     0|         3|
|  3|     0|         4|
|  6|     0|         5|
|  6|     0|         6|
|  1|     1|         1|
|  1|     1|         2|
|  4|     1|         3|
|  4|     1|         4|
|  7|     1|         5|
|  7|     1|         6|
|  2|     2|         1|
|  2|     2|         2|
|  5|     2|         3|
|  5|     2|         4|
|  8|     2|         5|
|  8|     2|         6|
+---+------+----------+

val buckets = spark.range(9).withColumn("bucket", 'id % 3)

// Make duplicates

val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy('bucket).orderBy('id)

scala> dataset.withColumn("row_number", row_number() over windowSpec).show

+---+------+----------+

| id|bucket|row_number|

+---+------+----------+

| 0| 0| 1|

| 0| 0| 2|

| 3| 0| 3|

| 3| 0| 4|

| 6| 0| 5|

| 6| 0| 6|

| 1| 1| 1|

| 1| 1| 2|

| 4| 1| 3|

| 4| 1| 4|

| 7| 1| 5|

| 7| 1| 6|

| 2| 2| 1|

| 2| 2| 2|

| 5| 2| 3|

| 5| 2| 4|

| 8| 2| 5|

| 8| 2| 6|

+---+------+----------+

`ntile` Window Function



ntile(n: Int): Column

ntile(n: Int): Column

ntile computes the ntile group id (from 1 to n inclusive) in an ordered window partition.



val dataset = spark.range(7).select('*, 'id % 3 as "bucket")

import org.apache.spark.sql.expressions.Window
val byBuckets = Window.partitionBy('bucket).orderBy('id)
scala> dataset.select('*, ntile(3) over byBuckets as "ntile").show
+---+------+-----+
| id|bucket|ntile|
+---+------+-----+
|  0|     0|    1|
|  3|     0|    2|
|  6|     0|    3|
|  1|     1|    1|
|  4|     1|    2|
|  2|     2|    1|
|  5|     2|    2|
+---+------+-----+

val dataset = spark.range(7).select('*, 'id % 3 as "bucket")

import org.apache.spark.sql.expressions.Window

val byBuckets = Window.partitionBy('bucket).orderBy('id)

scala> dataset.select('*, ntile(3) over byBuckets as "ntile").show

+---+------+-----+

| id|bucket|ntile|

+---+------+-----+

| 0| 0| 1|

| 3| 0| 2|

| 6| 0| 3|

| 1| 1| 1|

| 4| 1| 2|

| 2| 2| 1|

| 5| 2| 2|

+---+------+-----+

Caution

FIXME How is ntile different from rank? What about performance?

Ranking Records per Window Partition — `rank` Window Function



rank(): Column
dense_rank(): Column
percent_rank(): Column

rank(): Column

dense_rank(): Column

percent_rank(): Column

rank functions assign the sequential rank of each distinct value per window partition. They are equivalent to RANK, DENSE_RANK and PERCENT_RANK functions in the good ol’ SQL.



val dataset = spark.range(9).withColumn("bucket", 'id % 3)

import org.apache.spark.sql.expressions.Window
val byBucket = Window.partitionBy('bucket).orderBy('id)

scala> dataset.withColumn("rank", rank over byBucket).show
+---+------+----+
| id|bucket|rank|
+---+------+----+
|  0|     0|   1|
|  3|     0|   2|
|  6|     0|   3|
|  1|     1|   1|
|  4|     1|   2|
|  7|     1|   3|
|  2|     2|   1|
|  5|     2|   2|
|  8|     2|   3|
+---+------+----+

scala> dataset.withColumn("percent_rank", percent_rank over byBucket).show
+---+------+------------+
| id|bucket|percent_rank|
+---+------+------------+
|  0|     0|         0.0|
|  3|     0|         0.5|
|  6|     0|         1.0|
|  1|     1|         0.0|
|  4|     1|         0.5|
|  7|     1|         1.0|
|  2|     2|         0.0|
|  5|     2|         0.5|
|  8|     2|         1.0|
+---+------+------------+

val dataset = spark.range(9).withColumn("bucket", 'id % 3)

import org.apache.spark.sql.expressions.Window

val byBucket = Window.partitionBy('bucket).orderBy('id)

scala> dataset.withColumn("rank", rank over byBucket).show

+---+------+----+

| id|bucket|rank|

+---+------+----+

| 0| 0| 1|

| 3| 0| 2|

| 6| 0| 3|

| 1| 1| 1|

| 4| 1| 2|

| 7| 1| 3|

| 2| 2| 1|

| 5| 2| 2|

| 8| 2| 3|

+---+------+----+

scala> dataset.withColumn("percent_rank", percent_rank over byBucket).show

+---+------+------------+

| id|bucket|percent_rank|

+---+------+------------+

| 0| 0| 0.0|

| 3| 0| 0.5|

| 6| 0| 1.0|

| 1| 1| 0.0|

| 4| 1| 0.5|

| 7| 1| 1.0|

| 2| 2| 0.0|

| 5| 2| 0.5|

| 8| 2| 1.0|

+---+------+------------+

rank function assigns the same rank for duplicate rows with a gap in the sequence (similarly to Olympic medal places). dense_rank is like rank for duplicate rows but compacts the ranks and removes the gaps.



// rank function with duplicates
// Note the missing/sparse ranks, i.e. 2 and 4
scala> dataset.union(dataset).withColumn("rank", rank over byBucket).show
+---+------+----+
| id|bucket|rank|
+---+------+----+
|  0|     0|   1|
|  0|     0|   1|
|  3|     0|   3|
|  3|     0|   3|
|  6|     0|   5|
|  6|     0|   5|
|  1|     1|   1|
|  1|     1|   1|
|  4|     1|   3|
|  4|     1|   3|
|  7|     1|   5|
|  7|     1|   5|
|  2|     2|   1|
|  2|     2|   1|
|  5|     2|   3|
|  5|     2|   3|
|  8|     2|   5|
|  8|     2|   5|
+---+------+----+

// dense_rank function with duplicates
// Note that the missing ranks are now filled in
scala> dataset.union(dataset).withColumn("dense_rank", dense_rank over byBucket).show
+---+------+----------+
| id|bucket|dense_rank|
+---+------+----------+
|  0|     0|         1|
|  0|     0|         1|
|  3|     0|         2|
|  3|     0|         2|
|  6|     0|         3|
|  6|     0|         3|
|  1|     1|         1|
|  1|     1|         1|
|  4|     1|         2|
|  4|     1|         2|
|  7|     1|         3|
|  7|     1|         3|
|  2|     2|         1|
|  2|     2|         1|
|  5|     2|         2|
|  5|     2|         2|
|  8|     2|         3|
|  8|     2|         3|
+---+------+----------+

// percent_rank function with duplicates
scala> dataset.union(dataset).withColumn("percent_rank", percent_rank over byBucket).show
+---+------+------------+
| id|bucket|percent_rank|
+---+------+------------+
|  0|     0|         0.0|
|  0|     0|         0.0|
|  3|     0|         0.4|
|  3|     0|         0.4|
|  6|     0|         0.8|
|  6|     0|         0.8|
|  1|     1|         0.0|
|  1|     1|         0.0|
|  4|     1|         0.4|
|  4|     1|         0.4|
|  7|     1|         0.8|
|  7|     1|         0.8|
|  2|     2|         0.0|
|  2|     2|         0.0|
|  5|     2|         0.4|
|  5|     2|         0.4|
|  8|     2|         0.8|
|  8|     2|         0.8|
+---+------+------------+

// rank function with duplicates

// Note the missing/sparse ranks, i.e. 2 and 4

scala> dataset.union(dataset).withColumn("rank", rank over byBucket).show

+---+------+----+

| id|bucket|rank|

+---+------+----+

| 0| 0| 1|

| 3| 0| 3|

| 6| 0| 5|

| 1| 1| 1|

| 4| 1| 3|

| 7| 1| 5|

| 2| 2| 1|

| 5| 2| 3|

| 8| 2| 5|

+---+------+----+

// dense_rank function with duplicates

// Note that the missing ranks are now filled in

scala> dataset.union(dataset).withColumn("dense_rank", dense_rank over byBucket).show

+---+------+----------+

| id|bucket|dense_rank|

+---+------+----------+

| 0| 0| 1|

| 3| 0| 2|

| 6| 0| 3|

| 1| 1| 1|

| 4| 1| 2|

| 7| 1| 3|

| 2| 2| 1|

| 5| 2| 2|

| 8| 2| 3|

+---+------+----------+

// percent_rank function with duplicates

scala> dataset.union(dataset).withColumn("percent_rank", percent_rank over byBucket).show

+---+------+------------+

| id|bucket|percent_rank|

+---+------+------------+

| 0| 0| 0.0|

| 3| 0| 0.4|

| 6| 0| 0.8|

| 1| 1| 0.0|

| 4| 1| 0.4|

| 7| 1| 0.8|

| 2| 2| 0.0|

| 5| 2| 0.4|

| 8| 2| 0.8|

+---+------+------------+

`currentRow` Window Function



currentRow(): Column

currentRow(): Column

currentRow…FIXME

`unboundedFollowing` Window Function



unboundedFollowing(): Column

unboundedFollowing(): Column

unboundedFollowing…FIXME

`unboundedPreceding` Window Function



unboundedPreceding(): Column

unboundedPreceding(): Column

unboundedPreceding…FIXME

Regular Functions (Non-Aggregate Functions)

2012-02-15admin阅读(1291)

Regular Functions (Non-Aggregate Functions)

Table 1. (Subset of) Regular Functions
Name	Description
array
broadcast
coalesce	Gives the first non-`null` value among the given columns or `null`.
col and column	Creating Columns
expr
lit
map
monotonically_increasing_id
struct
typedLit
when

`broadcast` Function



broadcast[T](df: Dataset[T]): Dataset[T]

broadcast[T](df: Dataset[T]): Dataset[T]

broadcast function marks the input Dataset as small enough to be used in broadcast join.

Tip	Read up on Broadcast Joins (aka Map-Side Joins).



val left = Seq((0, "aa"), (0, "bb")).toDF("id", "token").as[(Int, String)]
val right = Seq(("aa", 0.99), ("bb", 0.57)).toDF("token", "prob").as[(String, Double)]

scala> left.join(broadcast(right), "token").explain(extended = true)
== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(token))
:- Project [_1#123 AS id#126, _2#124 AS token#127]
:  +- LocalRelation [_1#123, _2#124]
+- BroadcastHint
   +- Project [_1#136 AS token#139, _2#137 AS prob#140]
      +- LocalRelation [_1#136, _2#137]

== Analyzed Logical Plan ==
token: string, id: int, prob: double
Project [token#127, id#126, prob#140]
+- Join Inner, (token#127 = token#139)
   :- Project [_1#123 AS id#126, _2#124 AS token#127]
   :  +- LocalRelation [_1#123, _2#124]
   +- BroadcastHint
      +- Project [_1#136 AS token#139, _2#137 AS prob#140]
         +- LocalRelation [_1#136, _2#137]

== Optimized Logical Plan ==
Project [token#127, id#126, prob#140]
+- Join Inner, (token#127 = token#139)
   :- Project [_1#123 AS id#126, _2#124 AS token#127]
   :  +- Filter isnotnull(_2#124)
   :     +- LocalRelation [_1#123, _2#124]
   +- BroadcastHint
      +- Project [_1#136 AS token#139, _2#137 AS prob#140]
         +- Filter isnotnull(_1#136)
            +- LocalRelation [_1#136, _2#137]

== Physical Plan ==
*Project [token#127, id#126, prob#140]
+- *BroadcastHashJoin [token#127], [token#139], Inner, BuildRight
   :- *Project [_1#123 AS id#126, _2#124 AS token#127]
   :  +- *Filter isnotnull(_2#124)
   :     +- LocalTableScan [_1#123, _2#124]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      +- *Project [_1#136 AS token#139, _2#137 AS prob#140]
         +- *Filter isnotnull(_1#136)
            +- LocalTableScan [_1#136, _2#137]

val left = Seq((0, "aa"), (0, "bb")).toDF("id", "token").as[(Int, String)]

val right = Seq(("aa", 0.99), ("bb", 0.57)).toDF("token", "prob").as[(String, Double)]

scala> left.join(broadcast(right), "token").explain(extended = true)

== Parsed Logical Plan ==

'Join UsingJoin(Inner,List(token))

:- Project [_1#123 AS id#126, _2#124 AS token#127]

: +- LocalRelation [_1#123, _2#124]

+- BroadcastHint

+- Project [_1#136 AS token#139, _2#137 AS prob#140]

+- LocalRelation [_1#136, _2#137]

== Analyzed Logical Plan ==

token: string, id: int, prob: double

Project [token#127, id#126, prob#140]

+- Join Inner, (token#127 = token#139)

:- Project [_1#123 AS id#126, _2#124 AS token#127]

: +- LocalRelation [_1#123, _2#124]

+- BroadcastHint

+- Project [_1#136 AS token#139, _2#137 AS prob#140]

+- LocalRelation [_1#136, _2#137]

== Optimized Logical Plan ==

Project [token#127, id#126, prob#140]

+- Join Inner, (token#127 = token#139)

:- Project [_1#123 AS id#126, _2#124 AS token#127]

: +- Filter isnotnull(_2#124)

: +- LocalRelation [_1#123, _2#124]

+- BroadcastHint

+- Project [_1#136 AS token#139, _2#137 AS prob#140]

+- Filter isnotnull(_1#136)

+- LocalRelation [_1#136, _2#137]

== Physical Plan ==

*Project [token#127, id#126, prob#140]

+- *BroadcastHashJoin [token#127], [token#139], Inner, BuildRight

:- *Project [_1#123 AS id#126, _2#124 AS token#127]

: +- *Filter isnotnull(_2#124)

: +- LocalTableScan [_1#123, _2#124]

+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))

+- *Project [_1#136 AS token#139, _2#137 AS prob#140]

+- *Filter isnotnull(_1#136)

+- LocalTableScan [_1#136, _2#137]

Note	`broadcast` standard function is a special case of Dataset.hint operator that allows for attaching any hint to a logical plan.

`coalesce` Function



coalesce(e: Column*): Column

coalesce(e: Column*): Column

coalesce gives the first non-null value among the given columns or null.

coalesce requires at least one column and all columns have to be of the same or compatible types.

Internally, coalesce creates a Column with a Coalesce expression (with the children being the expressions of the input Column).

Example: `coalesce` Function



val q = spark.range(2)
  .select(
    coalesce(
      lit(null),
      lit(null),
      lit(2) + 2,
      $"id") as "first non-null value")
scala> q.show
+--------------------+
|first non-null value|
+--------------------+
|                   4|
|                   4|
+--------------------+

val q = spark.range(2)

.select(

coalesce(

lit(null),

lit(2) + 2,

$"id") as "first non-null value")

scala> q.show

+--------------------+

|first non-null value|

+--------------------+

| 4|

+--------------------+

Creating Columns — `col` and `column` Functions



col(colName: String): Column
column(colName: String): Column

col(colName: String): Column

column(colName: String): Column

col and column methods create a Column that you can later use to reference a column in a dataset.



import org.apache.spark.sql.functions._

scala> val nameCol = col("name")
nameCol: org.apache.spark.sql.Column = name

scala> val cityCol = column("city")
cityCol: org.apache.spark.sql.Column = city

import org.apache.spark.sql.functions._

scala> val nameCol = col("name")

nameCol: org.apache.spark.sql.Column = name

scala> val cityCol = column("city")

cityCol: org.apache.spark.sql.Column = city

`expr` Function



expr(expr: String): Column

expr(expr: String): Column

expr function parses the input expr SQL statement to a Column it represents.



val ds = Seq((0, "hello"), (1, "world"))
  .toDF("id", "token")
  .as[(Long, String)]

scala> ds.show
+---+-----+
| id|token|
+---+-----+
|  0|hello|
|  1|world|
+---+-----+

val filterExpr = expr("token = 'hello'")

scala> ds.filter(filterExpr).show
+---+-----+
| id|token|
+---+-----+
|  0|hello|
+---+-----+

val ds = Seq((0, "hello"), (1, "world"))

.toDF("id", "token")

.as[(Long, String)]

scala> ds.show

+---+-----+

| id|token|

+---+-----+

| 0|hello|

| 1|world|

+---+-----+

val filterExpr = expr("token = 'hello'")

scala> ds.filter(filterExpr).show

+---+-----+

| id|token|

+---+-----+

| 0|hello|

+---+-----+

Internally, expr uses the active session’s sqlParser or creates a new SparkSqlParser to call parseExpression method.

`lit` Function



lit(literal: Any): Column

lit(literal: Any): Column

lit function…FIXME

`struct` Functions



struct(cols: Column*): Column
struct(colName: String, colNames: String*): Column

struct(cols: Column*): Column

struct(colName: String, colNames: String*): Column

struct family of functions allows you to create a new struct column based on a collection of Column or their names.

Note	The difference between `struct` and another similar `array` function is that the types of the columns can be different (in `struct`).



scala> df.withColumn("struct", struct($"name", $"val")).show
+---+---+-----+---------+
| id|val| name|   struct|
+---+---+-----+---------+
|  0|  1|hello|[hello,1]|
|  2|  3|world|[world,3]|
|  2|  4|  ala|  [ala,4]|
+---+---+-----+---------+

scala> df.withColumn("struct", struct($"name", $"val")).show

+---+---+-----+---------+

| id|val| name| struct|

+---+---+-----+---------+

| 0| 1|hello|[hello,1]|

| 2| 3|world|[world,3]|

| 2| 4| ala| [ala,4]|

+---+---+-----+---------+

`typedLit` Function



typedLit[T : TypeTag](literal: T): Column

typedLit[T : TypeTag](literal: T): Column

typedLit…FIXME

`array` Function



array(cols: Column*): Column
array(colName: String, colNames: String*): Column

array(cols: Column*): Column

array(colName: String, colNames: String*): Column

array…FIXME

`map` Function



map(cols: Column*): Column

map(cols: Column*): Column

map…FIXME

`when` Function



when(condition: Column, value: Any): Column

when(condition: Column, value: Any): Column

when…FIXME

`monotonically_increasing_id` Function



monotonically_increasing_id(): Column

monotonically_increasing_id(): Column

monotonically_increasing_id returns monotonically increasing 64-bit integers. The generated IDs are guaranteed to be monotonically increasing and unique, but not consecutive (unless all rows are in the same single partition which you rarely want due to the amount of the data).



val q = spark.range(1).select(monotonically_increasing_id)
scala> q.show
+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                  60129542144|
+-----------------------------+

val q = spark.range(1).select(monotonically_increasing_id)

scala> q.show

+-----------------------------+

|monotonically_increasing_id()|

+-----------------------------+

| 60129542144|

+-----------------------------+

The current implementation uses the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. That assumes that the data set has less than 1 billion partitions, and each partition has less than 8 billion records.



// Demo to show the internals of monotonically_increasing_id function
// i.e. how MonotonicallyIncreasingID expression works

// Create a dataset with the same number of rows per partition
val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)

// Make sure that every partition has the same number of rows
q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next == 2))
q.select(monotonically_increasing_id).show

// Assign consecutive IDs for rows per partition
import org.apache.spark.sql.expressions.Window
// count is the name of the internal registry of MonotonicallyIncreasingID to count rows
// Could also be "id" since it is unique and consecutive in a partition
import org.apache.spark.sql.functions.{row_number, shiftLeft, spark_partition_id}
val rowNumber = row_number over Window.partitionBy(spark_partition_id).orderBy("id")
// row_number is a sequential number starting at 1 within a window partition
val count = rowNumber - 1 as "count"
val partitionMask = shiftLeft(spark_partition_id cast "long", 33) as "partitionMask"
// FIXME Why does the following sum give "weird" results?!
val sum = (partitionMask + count) as "partitionMask + count"
val demo = q.select(
  $"id",
  partitionMask,
  count,
  // FIXME sum,
  monotonically_increasing_id)
scala> demo.orderBy("id").show
+---+-------------+-----+-----------------------------+
| id|partitionMask|count|monotonically_increasing_id()|
+---+-------------+-----+-----------------------------+
|  0|            0|    0|                            0|
|  1|            0|    1|                            1|
|  2|   8589934592|    0|                   8589934592|
|  3|   8589934592|    1|                   8589934593|
|  4|  17179869184|    0|                  17179869184|
|  5|  17179869184|    1|                  17179869185|
|  6|  25769803776|    0|                  25769803776|
|  7|  25769803776|    1|                  25769803777|
+---+-------------+-----+-----------------------------+

// Demo to show the internals of monotonically_increasing_id function

// i.e. how MonotonicallyIncreasingID expression works

// Create a dataset with the same number of rows per partition

val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)

// Make sure that every partition has the same number of rows

q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next == 2))

q.select(monotonically_increasing_id).show

// Assign consecutive IDs for rows per partition

import org.apache.spark.sql.expressions.Window

// count is the name of the internal registry of MonotonicallyIncreasingID to count rows

// Could also be "id" since it is unique and consecutive in a partition

import org.apache.spark.sql.functions.{row_number, shiftLeft, spark_partition_id}

val rowNumber = row_number over Window.partitionBy(spark_partition_id).orderBy("id")

// row_number is a sequential number starting at 1 within a window partition

val count = rowNumber - 1 as "count"

val partitionMask = shiftLeft(spark_partition_id cast "long", 33) as "partitionMask"

// FIXME Why does the following sum give "weird" results?!

val sum = (partitionMask + count) as "partitionMask + count"

val demo = q.select(

$"id",

partitionMask,

count,

// FIXME sum,

monotonically_increasing_id)

scala> demo.orderBy("id").show

+---+-------------+-----+-----------------------------+

+---+-------------+-----+-----------------------------+

| 0| 0| 0| 0|

| 1| 0| 1| 1|

| 2| 8589934592| 0| 8589934592|

| 3| 8589934592| 1| 8589934593|

| 4| 17179869184| 0| 17179869184|

| 5| 17179869184| 1| 17179869185|

| 6| 25769803776| 0| 25769803776|

| 7| 25769803776| 1| 25769803777|

+---+-------------+-----+-----------------------------+

Internally, monotonically_increasing_id creates a Column with a MonotonicallyIncreasingID non-deterministic leaf expression.

Date and Time Functions

2012-02-14admin阅读(1638)

Date and Time Functions

Table 1. (Subset of) Standard Functions for Date and Time
Name	Description
current_date	Gives current date as a date column
current_timestamp
date_format
to_date	Converts column to date type (with an optional date format)
to_timestamp	Converts column to timestamp type (with an optional timestamp format)
unix_timestamp	Converts current or specified time to Unix timestamp (in seconds)
window	Generates time windows (i.e. tumbling, sliding and delayed windows)

Current Date As Date Column — `current_date` Function



current_date(): Column

current_date(): Column

current_date function gives the current date as a date column.



val df = spark.range(1).select(current_date)
scala> df.show
+--------------+
|current_date()|
+--------------+
|    2017-09-16|
+--------------+

scala> df.printSchema
root
 |-- current_date(): date (nullable = false)

val df = spark.range(1).select(current_date)

scala> df.show

+--------------+

|current_date()|

+--------------+

| 2017-09-16|

+--------------+

scala> df.printSchema

root

|-- current_date(): date (nullable = false)

Internally, current_date creates a Column with CurrentDate Catalyst leaf expression.



val c = current_date()
import org.apache.spark.sql.catalyst.expressions.CurrentDate
val cd = c.expr.asInstanceOf[CurrentDate]
scala> println(cd.prettyName)
current_date

scala> println(cd.numberedTreeString)
00 current_date(None)

val c = current_date()

import org.apache.spark.sql.catalyst.expressions.CurrentDate

val cd = c.expr.asInstanceOf[CurrentDate]

scala> println(cd.prettyName)

current_date

scala> println(cd.numberedTreeString)

00 current_date(None)

`date_format` Function



date_format(dateExpr: Column, format: String): Column

date_format(dateExpr: Column, format: String): Column

Internally, date_format creates a Column with DateFormatClass binary expression. DateFormatClass takes the expression from dateExpr column and format.



val c = date_format($"date", "dd/MM/yyyy")

import org.apache.spark.sql.catalyst.expressions.DateFormatClass
val dfc = c.expr.asInstanceOf[DateFormatClass]
scala> println(dfc.prettyName)
date_format

scala> println(dfc.numberedTreeString)
00 date_format('date, dd/MM/yyyy, None)
01 :- 'date
02 +- dd/MM/yyyy

val c = date_format($"date", "dd/MM/yyyy")

import org.apache.spark.sql.catalyst.expressions.DateFormatClass

val dfc = c.expr.asInstanceOf[DateFormatClass]

scala> println(dfc.prettyName)

date_format

scala> println(dfc.numberedTreeString)

00 date_format('date, dd/MM/yyyy, None)

01 :- 'date

02 +- dd/MM/yyyy

`current_timestamp` Function



current_timestamp(): Column

current_timestamp(): Column

Caution

FIXME

Note	`current_timestamp` is also `now` function in SQL.

Converting Current or Specified Time to Unix Timestamp — `unix_timestamp` Function



unix_timestamp(): Column  (1)
unix_timestamp(time: Column): Column (2)
unix_timestamp(time: Column, format: String): Column

unix_timestamp(): Column (1)

unix_timestamp(time: Column): Column (2)

unix_timestamp(time: Column, format: String): Column

Gives current timestamp (in seconds)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds)

unix_timestamp converts the current or specified time in the specified format to a Unix timestamp (in seconds).

unix_timestamp supports a column of type Date, Timestamp or String.



// no time and format => current time
scala> spark.range(1).select(unix_timestamp as "current_timestamp").show
+-----------------+
|current_timestamp|
+-----------------+
|       1493362850|
+-----------------+

// no format so yyyy-MM-dd HH:mm:ss assumed
scala> Seq("2017-01-01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_timestamp($"time")).show
+-------------------+--------------+
|               time|unix_timestamp|
+-------------------+--------------+
|2017-01-01 00:00:00|    1483225200|
+-------------------+--------------+

scala> Seq("2017/01/01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_timestamp($"time", "yyyy/MM/dd")).show
+-------------------+--------------+
|               time|unix_timestamp|
+-------------------+--------------+
|2017/01/01 00:00:00|    1483225200|
+-------------------+--------------+

// no time and format => current time

scala> spark.range(1).select(unix_timestamp as "current_timestamp").show

+-----------------+

|current_timestamp|

+-----------------+

| 1493362850|

+-----------------+

// no format so yyyy-MM-dd HH:mm:ss assumed

scala> Seq("2017-01-01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_timestamp($"time")).show

+-------------------+--------------+

| time|unix_timestamp|

+-------------------+--------------+

|2017-01-01 00:00:00| 1483225200|

+-------------------+--------------+

scala> Seq("2017/01/01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_timestamp($"time", "yyyy/MM/dd")).show

+-------------------+--------------+

| time|unix_timestamp|

+-------------------+--------------+

|2017/01/01 00:00:00| 1483225200|

+-------------------+--------------+

unix_timestamp returns null if conversion fails.



// note slashes as date separators
scala> Seq("2017/01/01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_timestamp($"time")).show
+-------------------+--------------+
|               time|unix_timestamp|
+-------------------+--------------+
|2017/01/01 00:00:00|          null|
+-------------------+--------------+

// note slashes as date separators

scala> Seq("2017/01/01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_timestamp($"time")).show

+-------------------+--------------+

| time|unix_timestamp|

+-------------------+--------------+

|2017/01/01 00:00:00| null|

+-------------------+--------------+

Note

unix_timestamp is also supported in SQL mode.



scala> spark.sql("SELECT unix_timestamp() as unix_timestamp").show
+--------------+
|unix_timestamp|
+--------------+
|    1493369225|
+--------------+

scala> spark.sql("SELECT unix_timestamp() as unix_timestamp").show

+--------------+

|unix_timestamp|

+--------------+

| 1493369225|

+--------------+

Internally, unix_timestamp creates a Column with UnixTimestamp binary expression (possibly with CurrentTimestamp).

Generating Time Windows — `window` Function



window(
  timeColumn: Column,
  windowDuration: String): Column  (1)
window(
  timeColumn: Column,
  windowDuration: String,
  slideDuration: String): Column   (2)
window(
  timeColumn: Column,
  windowDuration: String,
  slideDuration: String,
  startTime: String): Column       (3)

window(

timeColumn: Column,

windowDuration: String): Column (1)

window(

timeColumn: Column,

windowDuration: String,

slideDuration: String): Column (2)

window(

timeColumn: Column,

windowDuration: String,

slideDuration: String,

startTime: String): Column (3)

Creates a tumbling time window with slideDuration as windowDuration and 0 second for startTime
Creates a sliding time window with 0 second for startTime
Creates a delayed time window

window generates tumbling, sliding or delayed time windows of windowDuration duration given a timeColumn timestamp specifying column.

Note	From Tumbling Window (Azure Stream Analytics): Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.

Note	From Introducing Stream Windows in Apache Flink: Tumbling windows group elements of a stream into finite sets where each set corresponds to an interval. Tumbling windows discretize a stream into non-overlapping windows.



scala> val timeColumn = window('time, "5 seconds")
timeColumn: org.apache.spark.sql.Column = timewindow(time, 5000000, 5000000, 0) AS `window`

scala> val timeColumn = window('time, "5 seconds")

timeColumn: org.apache.spark.sql.Column = timewindow(time, 5000000, 5000000, 0) AS `window`

timeColumn should be of TimestampType, i.e. with java.sql.Timestamp values.

Tip	Use java.sql.Timestamp.from or java.sql.Timestamp.valueOf factory methods to create `Timestamp` instances.



// https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html
import java.time.LocalDateTime
// https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html
import java.sql.Timestamp
val levels = Seq(
  // (year, month, dayOfMonth, hour, minute, second)
  ((2012, 12, 12, 12, 12, 12), 5),
  ((2012, 12, 12, 12, 12, 14), 9),
  ((2012, 12, 12, 13, 13, 14), 4),
  ((2016, 8,  13, 0, 0, 0), 10),
  ((2017, 5,  27, 0, 0, 0), 15)).
  map { case ((yy, mm, dd, h, m, s), a) => (LocalDateTime.of(yy, mm, dd, h, m, s), a) }.
  map { case (ts, a) => (Timestamp.valueOf(ts), a) }.
  toDF("time", "level")
scala> levels.show
+-------------------+-----+
|               time|level|
+-------------------+-----+
|2012-12-12 12:12:12|    5|
|2012-12-12 12:12:14|    9|
|2012-12-12 13:13:14|    4|
|2016-08-13 00:00:00|   10|
|2017-05-27 00:00:00|   15|
+-------------------+-----+

val q = levels.select(window($"time", "5 seconds"), $"level")
scala> q.show(truncate = false)
+---------------------------------------------+-----+
|window                                       |level|
+---------------------------------------------+-----+
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|5    |
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|9    |
|[2012-12-12 13:13:10.0,2012-12-12 13:13:15.0]|4    |
|[2016-08-13 00:00:00.0,2016-08-13 00:00:05.0]|10   |
|[2017-05-27 00:00:00.0,2017-05-27 00:00:05.0]|15   |
+---------------------------------------------+-----+

scala> q.printSchema
root
 |-- window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- level: integer (nullable = false)

// calculating the sum of levels every 5 seconds
val sums = levels.
  groupBy(window($"time", "5 seconds")).
  agg(sum("level") as "level_sum").
  select("window.start", "window.end", "level_sum")
scala> sums.show
+-------------------+-------------------+---------+
|              start|                end|level_sum|
+-------------------+-------------------+---------+
|2012-12-12 13:13:10|2012-12-12 13:13:15|        4|
|2012-12-12 12:12:10|2012-12-12 12:12:15|       14|
|2016-08-13 00:00:00|2016-08-13 00:00:05|       10|
|2017-05-27 00:00:00|2017-05-27 00:00:05|       15|
+-------------------+-------------------+---------+

// https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html

import java.time.LocalDateTime

// https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html

import java.sql.Timestamp

val levels = Seq(

// (year, month, dayOfMonth, hour, minute, second)

((2012, 12, 12, 12, 12, 12), 5),

((2012, 12, 12, 12, 12, 14), 9),

((2012, 12, 12, 13, 13, 14), 4),

((2016, 8, 13, 0, 0, 0), 10),

((2017, 5, 27, 0, 0, 0), 15)).

map { case ((yy, mm, dd, h, m, s), a) => (LocalDateTime.of(yy, mm, dd, h, m, s), a) }.

map { case (ts, a) => (Timestamp.valueOf(ts), a) }.

toDF("time", "level")

scala> levels.show

+-------------------+-----+

| time|level|

+-------------------+-----+

|2012-12-12 12:12:12| 5|

|2012-12-12 12:12:14| 9|

|2012-12-12 13:13:14| 4|

|2016-08-13 00:00:00| 10|

|2017-05-27 00:00:00| 15|

+-------------------+-----+

val q = levels.select(window($"time", "5 seconds"), $"level")

scala> q.show(truncate = false)

+---------------------------------------------+-----+

|window |level|

+---------------------------------------------+-----+

|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|5 |

|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|9 |

|[2012-12-12 13:13:10.0,2012-12-12 13:13:15.0]|4 |

|[2016-08-13 00:00:00.0,2016-08-13 00:00:05.0]|10 |

|[2017-05-27 00:00:00.0,2017-05-27 00:00:05.0]|15 |

+---------------------------------------------+-----+

scala> q.printSchema

root

|-- window: struct (nullable = true)

| |-- start: timestamp (nullable = true)

| |-- end: timestamp (nullable = true)

|-- level: integer (nullable = false)

// calculating the sum of levels every 5 seconds

val sums = levels.

groupBy(window($"time", "5 seconds")).

agg(sum("level") as "level_sum").

select("window.start", "window.end", "level_sum")

scala> sums.show

+-------------------+-------------------+---------+

| start| end|level_sum|

+-------------------+-------------------+---------+

|2012-12-12 13:13:10|2012-12-12 13:13:15| 4|

|2012-12-12 12:12:10|2012-12-12 12:12:15| 14|

|2016-08-13 00:00:00|2016-08-13 00:00:05| 10|

|2017-05-27 00:00:00|2017-05-27 00:00:05| 15|

+-------------------+-------------------+---------+

windowDuration and slideDuration are strings specifying the width of the window for duration and sliding identifiers, respectively.

Tip	Use `CalendarInterval` for valid window identifiers.

Note	`window` is available as of Spark 2.0.0.

Internally, window creates a Column (with TimeWindow expression) available as window alias.



// q is the query defined earlier
scala> q.show(truncate = false)
+---------------------------------------------+-----+
|window                                       |level|
+---------------------------------------------+-----+
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|5    |
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|9    |
|[2012-12-12 13:13:10.0,2012-12-12 13:13:15.0]|4    |
|[2016-08-13 00:00:00.0,2016-08-13 00:00:05.0]|10   |
|[2017-05-27 00:00:00.0,2017-05-27 00:00:05.0]|15   |
+---------------------------------------------+-----+

scala> println(timeColumn.expr.numberedTreeString)
00 timewindow('time, 5000000, 5000000, 0) AS window#22
01 +- timewindow('time, 5000000, 5000000, 0)
02    +- 'time

// q is the query defined earlier

scala> q.show(truncate = false)

+---------------------------------------------+-----+

|window |level|

+---------------------------------------------+-----+

|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|5 |

|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|9 |

|[2012-12-12 13:13:10.0,2012-12-12 13:13:15.0]|4 |

|[2016-08-13 00:00:00.0,2016-08-13 00:00:05.0]|10 |

|[2017-05-27 00:00:00.0,2017-05-27 00:00:05.0]|15 |

+---------------------------------------------+-----+

scala> println(timeColumn.expr.numberedTreeString)

00 timewindow('time, 5000000, 5000000, 0) AS window#22

01 +- timewindow('time, 5000000, 5000000, 0)

02 +- 'time

Example — Traffic Sensor

Note	The example is borrowed from Introducing Stream Windows in Apache Flink.

The example shows how to use window function to model a traffic sensor that counts every 15 seconds the number of vehicles passing a certain location.

Converting Column To DateType — `to_date` Function



to_date(e: Column): Column
to_date(e: Column, fmt: String): Column

to_date(e: Column): Column

to_date(e: Column, fmt: String): Column

to_date converts the column into DateType (by casting to DateType).

Note	`fmt` follows the formatting styles.

Internally, to_date creates a Column with ParseToDate expression (and Literal expression for fmt).

Tip	Use ParseToDate expression to use a column for the values of `fmt`.

Converting Column To TimestampType — `to_timestamp` Function



to_timestamp(s: Column): Column
to_timestamp(s: Column, fmt: String): Column

to_timestamp(s: Column): Column

to_timestamp(s: Column, fmt: String): Column

to_timestamp converts the column into TimestampType (by casting to TimestampType).

Note	`fmt` follows the formatting styles.

Internally, to_timestamp creates a Column with ParseToTimestamp expression (and Literal expression for fmt).

Tip	Use ParseToTimestamp expression to use a column for the values of `fmt`.

Collection Functions

2012-02-13admin阅读(1279)

Standard Functions for Collections (Collection Functions)

Name Description

array_contains



array_contains(column: Column, value: Any): Column

array_contains(column: Column, value: Any): Column

explode



explode(e: Column): Column

explode(e: Column): Column

explode_outer



explode_outer(e: Column): Column

explode_outer(e: Column): Column

Creates a new row for each element in the given array or map column.

If the array/map is null or empty then null is produced.

from_json



from_json(e: Column, schema: DataType): Column
from_json(e: Column, schema: DataType, options: Map[String, String]): Column
from_json(e: Column, schema: String, options: Map[String, String]): Column
from_json(e: Column, schema: StructType): Column
from_json(e: Column, schema: StructType, options: Map[String, String]): Column

from_json(e: Column, schema: DataType): Column

from_json(e: Column, schema: DataType, options: Map[String, String]): Column

from_json(e: Column, schema: String, options: Map[String, String]): Column

from_json(e: Column, schema: StructType): Column

from_json(e: Column, schema: StructType, options: Map[String, String]): Column

Extract data from arbitrary JSON-encoded values into a StructType or ArrayType of StructType elements with the specified schema

map_keys



map_keys(e: Column): Column

map_keys(e: Column): Column

map_values



map_values(e: Column): Column

map_values(e: Column): Column

posexplode



posexplode(e: Column): Column

posexplode(e: Column): Column

posexplode_outer



posexplode_outer(e: Column): Column

posexplode_outer(e: Column): Column

reverse



reverse(e: Column): Column

reverse(e: Column): Column

Returns a reversed string or an array with reverse order of elements

Note	Support for reversing arrays is new in 2.4.0.

size



size(e: Column): Column

size(e: Column): Column

Returns the size of the given array or map. Returns -1 if null.

`reverse` Collection Function



reverse(e: Column): Column

reverse(e: Column): Column

reverse…FIXME

`size` Collection Function



size(e: Column): Column

size(e: Column): Column

size returns the size of the given array or map. Returns -1 if null.

Internally, size creates a Column with Size unary expression.



import org.apache.spark.sql.functions.size
val c = size('id)
scala> println(c.expr.asCode)
Size(UnresolvedAttribute(ArrayBuffer(id)))

import org.apache.spark.sql.functions.size

val c = size('id)

scala> println(c.expr.asCode)

Size(UnresolvedAttribute(ArrayBuffer(id)))

`posexplode` Collection Function



posexplode(e: Column): Column

posexplode(e: Column): Column

posexplode…FIXME

`posexplode_outer` Collection Function



posexplode_outer(e: Column): Column

posexplode_outer(e: Column): Column

posexplode_outer…FIXME

`explode` Collection Function

Caution

FIXME



scala> Seq(Array(0,1,2)).toDF("array").withColumn("num", explode('array)).show
+---------+---+
|    array|num|
+---------+---+
|[0, 1, 2]|  0|
|[0, 1, 2]|  1|
|[0, 1, 2]|  2|
+---------+---+

scala> Seq(Array(0,1,2)).toDF("array").withColumn("num", explode('array)).show

+---------+---+

| array|num|

+---------+---+

|[0, 1, 2]| 0|

|[0, 1, 2]| 1|

|[0, 1, 2]| 2|

+---------+---+

Note	`explode` function is an equivalent of `flatMap` operator for `Dataset`.

`explode_outer` Collection Function



explode_outer(e: Column): Column

explode_outer(e: Column): Column

explode_outer generates a new row for each element in e array or map column.

Note	Unlike explode, `explode_outer` generates `null` when the array or map is `null` or empty.



val arrays = Seq((1,Seq.empty[String])).toDF("id", "array")
scala> arrays.printSchema
root
 |-- id: integer (nullable = false)
 |-- array: array (nullable = true)
 |    |-- element: string (containsNull = true)
scala> arrays.select(explode_outer($"array")).show
+----+
| col|
+----+
|null|
+----+

val arrays = Seq((1,Seq.empty[String])).toDF("id", "array")

scala> arrays.printSchema

root

|-- id: integer (nullable = false)

|-- array: array (nullable = true)

| |-- element: string (containsNull = true)

scala> arrays.select(explode_outer($"array")).show

+----+

| col|

+----+

|null|

+----+

Internally, explode_outer creates a Column with GeneratorOuter and Explode Catalyst expressions.



val explodeOuter = explode_outer($"array").expr
scala> println(explodeOuter.numberedTreeString)
00 generatorouter(explode('array))
01 +- explode('array)
02    +- 'array

val explodeOuter = explode_outer($"array").expr

scala> println(explodeOuter.numberedTreeString)

00 generatorouter(explode('array))

01 +- explode('array)

02 +- 'array

Extracting Data from Arbitrary JSON-Encoded Values — `from_json` Collection Function



from_json(e: Column, schema: StructType, options: Map[String, String]): Column (1)
from_json(e: Column, schema: DataType, options: Map[String, String]): Column (2)
from_json(e: Column, schema: StructType): Column (3)
from_json(e: Column, schema: DataType): Column  (4)
from_json(e: Column, schema: String, options: Map[String, String]): Column (5)

from_json(e: Column, schema: StructType, options: Map[String, String]): Column (1)

from_json(e: Column, schema: DataType, options: Map[String, String]): Column (2)

from_json(e: Column, schema: StructType): Column (3)

from_json(e: Column, schema: DataType): Column (4)

from_json(e: Column, schema: String, options: Map[String, String]): Column (5)

Calls <2> with StructType converted to DataType
(fixme)
Calls <1> with empty options
Relays to the other from_json with empty options
Uses schema as DataType in the JSON format or falls back to StructType in the DDL format

from_json parses a column with a JSON-encoded value into a StructType or ArrayType of StructType elements with the specified schema.



val jsons = Seq("""{ "id": 0 }""").toDF("json")

import org.apache.spark.sql.types._
val schema = new StructType()
  .add($"id".int.copy(nullable = false))

import org.apache.spark.sql.functions.from_json
scala> jsons.select(from_json($"json", schema) as "ids").show
+---+
|ids|
+---+
|[0]|
+---+

val jsons = Seq("""{ "id": 0 }""").toDF("json")

import org.apache.spark.sql.types._

val schema = new StructType()

.add($"id".int.copy(nullable = false))

import org.apache.spark.sql.functions.from_json

scala> jsons.select(from_json($"json", schema) as "ids").show

+---+

|ids|

+---+

|[0]|

+---+

Note	A schema can be one of the following: DataType as a Scala object or in the JSON format StructType in the DDL format



// Define the schema for JSON-encoded messages
// Note that the schema is nested (on the addresses field)
import org.apache.spark.sql.types._
val addressesSchema = new StructType()
  .add($"city".string)
  .add($"state".string)
  .add($"zip".string)
val schema = new StructType()
  .add($"firstName".string)
  .add($"lastName".string)
  .add($"email".string)
  .add($"addresses".array(addressesSchema))
scala> schema.printTreeString
root
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- email: string (nullable = true)
 |-- addresses: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |    |-- zip: string (nullable = true)

// Generate the JSON-encoded schema
// That's the variant of the schema that from_json accepts
val schemaAsJson = schema.json

// Use prettyJson to print out the JSON-encoded schema
// Only for demo purposes
scala> println(schema.prettyJson)
{
  "type" : "struct",
  "fields" : [ {
    "name" : "firstName",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "lastName",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "email",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "addresses",
    "type" : {
      "type" : "array",
      "elementType" : {
        "type" : "struct",
        "fields" : [ {
          "name" : "city",
          "type" : "string",
          "nullable" : true,
          "metadata" : { }
        }, {
          "name" : "state",
          "type" : "string",
          "nullable" : true,
          "metadata" : { }
        }, {
          "name" : "zip",
          "type" : "string",
          "nullable" : true,
          "metadata" : { }
        } ]
      },
      "containsNull" : true
    },
    "nullable" : true,
    "metadata" : { }
  } ]
}

// Let's "validate" the JSON-encoded schema
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
scala> println(dt.sql)
STRUCT<`firstName`: STRING, `lastName`: STRING, `email`: STRING, `addresses`: ARRAY<STRUCT<`city`: STRING, `state`: STRING, `zip`: STRING>>>

// No exception means that the JSON-encoded schema should be fine
// Use it with from_json
val rawJsons = Seq("""
  {
    "firstName" : "Jacek",
    "lastName" : "Laskowski",
    "email" : "jacek@japila.pl",
    "addresses" : [
      {
        "city" : "Warsaw",
        "state" : "N/A",
        "zip" : "02-791"
      }
    ]
  }
""").toDF("rawjson")
val people = rawJsons
  .select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
  .select("json.*") // <-- flatten the struct field
  .withColumn("address", explode($"addresses")) // <-- explode the array field
  .drop("addresses")  // <-- no longer needed
  .select("firstName", "lastName", "email", "address.*") // <-- flatten the struct field
scala> people.show
+---------+---------+---------------+------+-----+------+
|firstName| lastName|          email|  city|state|   zip|
+---------+---------+---------------+------+-----+------+
|    Jacek|Laskowski|jacek@japila.pl|Warsaw|  N/A|02-791|
+---------+---------+---------------+------+-----+------+

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

// Define the schema for JSON-encoded messages

// Note that the schema is nested (on the addresses field)

import org.apache.spark.sql.types._

val addressesSchema = new StructType()

.add($"city".string)

.add($"state".string)

.add($"zip".string)

val schema = new StructType()

.add($"firstName".string)

.add($"lastName".string)

.add($"email".string)

.add($"addresses".array(addressesSchema))

scala> schema.printTreeString

root

|-- firstName: string (nullable = true)

|-- lastName: string (nullable = true)

|-- email: string (nullable = true)

|-- addresses: array (nullable = true)

| |-- element: struct (containsNull = true)

| | |-- city: string (nullable = true)

| | |-- state: string (nullable = true)

| | |-- zip: string (nullable = true)

// Generate the JSON-encoded schema

// That's the variant of the schema that from_json accepts

val schemaAsJson = schema.json

// Use prettyJson to print out the JSON-encoded schema

// Only for demo purposes

scala> println(schema.prettyJson)

{

"type" : "struct",

"fields" : [ {

"name" : "firstName",

"type" : "string",

"nullable" : true,

"metadata" : { }

}, {

"name" : "lastName",

"type" : "string",

"nullable" : true,

"metadata" : { }

}, {

"name" : "email",

"type" : "string",

"nullable" : true,

"metadata" : { }

}, {

"name" : "addresses",

"type" : {

"type" : "array",

"elementType" : {

"type" : "struct",

"fields" : [ {

"name" : "city",

"type" : "string",

"nullable" : true,

"metadata" : { }

}, {

"name" : "state",

"type" : "string",

"nullable" : true,

"metadata" : { }

}, {

"name" : "zip",

"type" : "string",

"nullable" : true,

"metadata" : { }

} ]

"containsNull" : true

"nullable" : true,

"metadata" : { }

} ]

}

// Let's "validate" the JSON-encoded schema

import org.apache.spark.sql.types.DataType

val dt = DataType.fromJson(schemaAsJson)

scala> println(dt.sql)

STRUCT<`firstName`: STRING, `lastName`: STRING, `email`: STRING, `addresses`: ARRAY<STRUCT<`city`: STRING, `state`: STRING, `zip`: STRING>>>

// No exception means that the JSON-encoded schema should be fine

// Use it with from_json

val rawJsons = Seq("""

{

"firstName" : "Jacek",

"lastName" : "Laskowski",

"email" : "jacek@japila.pl",

"addresses" : [

{

"city" : "Warsaw",

"state" : "N/A",

"zip" : "02-791"

}

]

}

""").toDF("rawjson")

val people = rawJsons

.select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")

.select("json.*") // <-- flatten the struct field

.withColumn("address", explode($"addresses")) // <-- explode the array field

.drop("addresses") // <-- no longer needed

.select("firstName", "lastName", "email", "address.*") // <-- flatten the struct field

scala> people.show

+---------+---------+---------------+------+-----+------+

+---------+---------+---------------+------+-----+------+

+---------+---------+---------------+------+-----+------+

Note	`options` controls how a JSON is parsed and contains the same options as the json format.

Internally, from_json creates a Column with JsonToStructs unary expression.

Note	`from_json` (creates a JsonToStructs that) uses a JSON parser in FAILFAST parsing mode that simply fails early when a corrupted/malformed record is found (and hence does not support `columnNameOfCorruptRecord` JSON option).



val jsons = Seq("""{ id: 0 }""").toDF("json")

import org.apache.spark.sql.types._
val schema = new StructType()
  .add($"id".int.copy(nullable = false))
  .add($"corrupted_records".string)
val opts = Map("columnNameOfCorruptRecord" -> "corrupted_records")
scala> jsons.select(from_json($"json", schema, opts) as "ids").show
+----+
| ids|
+----+
|null|
+----+

val jsons = Seq("""{ id: 0 }""").toDF("json")

import org.apache.spark.sql.types._

val schema = new StructType()

.add($"id".int.copy(nullable = false))

.add($"corrupted_records".string)

val opts = Map("columnNameOfCorruptRecord" -> "corrupted_records")

scala> jsons.select(from_json($"json", schema, opts) as "ids").show

+----+

| ids|

+----+

|null|

+----+

Note	`from_json` corresponds to SQL’s `from_json`.

`array_contains` Collection Function



array_contains(column: Column, value: Any): Column

array_contains(column: Column, value: Any): Column

array_contains creates a Column for a column argument as an array and the value of same type as the type of the elements of the array.

Internally, array_contains creates a Column with a ArrayContains expression.



// Arguments must be an array followed by a value of same type as the array elements
import org.apache.spark.sql.functions.array_contains
val c = array_contains(column = $"ids", value = 1)

val ids = Seq(Seq(1,2,3), Seq(1), Seq(2,3)).toDF("ids")
val q = ids.filter(c)
scala> q.show
+---------+
|      ids|
+---------+
|[1, 2, 3]|
|      [1]|
+---------+

// Arguments must be an array followed by a value of same type as the array elements

import org.apache.spark.sql.functions.array_contains

val c = array_contains(column = $"ids", value = 1)

val ids = Seq(Seq(1,2,3), Seq(1), Seq(2,3)).toDF("ids")

val q = ids.filter(c)

scala> q.show

+---------+

| ids|

+---------+

|[1, 2, 3]|

| [1]|

+---------+

array_contains corresponds to SQL’s array_contains.



import org.apache.spark.sql.functions.array_contains
val c = array_contains(column = $"ids", value = Array(1, 2))
val e = c.expr
scala> println(e.sql)
array_contains(`ids`, [1,2])

import org.apache.spark.sql.functions.array_contains

val c = array_contains(column = $"ids", value = Array(1, 2))

val e = c.expr

scala> println(e.sql)

array_contains(`ids`, [1,2])

Tip	Use SQL’s `array_contains` to use values from columns for the `column` and `value` arguments.



val codes = Seq(
  (Seq(1, 2, 3), 2),
  (Seq(1), 1),
  (Seq.empty[Int], 1),
  (Seq(2, 4, 6), 0)).toDF("codes", "cd")
scala> codes.show
+---------+---+
|    codes| cd|
+---------+---+
|[1, 2, 3]|  2|
|      [1]|  1|
|       []|  1|
|[2, 4, 6]|  0|
+---------+---+

val q = codes.where("array_contains(codes, cd)")
scala> q.show
+---------+---+
|    codes| cd|
+---------+---+
|[1, 2, 3]|  2|
|      [1]|  1|
+---------+---+

// array_contains standard function with Columns does NOT work. Why?!
// Asked this question on StackOverflow --> https://stackoverflow.com/q/50412939/1305344
val q = codes.where(array_contains($"codes", $"cd"))
scala> q.show
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.ColumnName cd
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
  at org.apache.spark.sql.functions$.array_contains(functions.scala:3046)
  ... 50 elided

// Thanks Russel for this excellent "workaround"
// https://stackoverflow.com/a/50413766/1305344
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.ArrayContains
val q = codes.where(new Column(ArrayContains($"codes".expr, $"cd".expr)))
scala> q.show
+---------+---+
|    codes| cd|
+---------+---+
|[1, 2, 3]|  2|
|      [1]|  1|
+---------+---+

val codes = Seq(

(Seq(1, 2, 3), 2),

(Seq(1), 1),

(Seq.empty[Int], 1),

(Seq(2, 4, 6), 0)).toDF("codes", "cd")

scala> codes.show

+---------+---+

| codes| cd|

+---------+---+

|[1, 2, 3]| 2|

| [1]| 1|

| []| 1|

|[2, 4, 6]| 0|

+---------+---+

val q = codes.where("array_contains(codes, cd)")

scala> q.show

+---------+---+

| codes| cd|

+---------+---+

|[1, 2, 3]| 2|

| [1]| 1|

+---------+---+

// array_contains standard function with Columns does NOT work. Why?!

// Asked this question on StackOverflow --> https://stackoverflow.com/q/50412939/1305344

val q = codes.where(array_contains($"codes", $"cd"))

scala> q.show

java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.ColumnName cd

at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)

at org.apache.spark.sql.functions$.array_contains(functions.scala:3046)

... 50 elided

// Thanks Russel for this excellent "workaround"

// https://stackoverflow.com/a/50413766/1305344

import org.apache.spark.sql.Column

import org.apache.spark.sql.catalyst.expressions.ArrayContains

val q = codes.where(new Column(ArrayContains($"codes".expr, $"cd".expr)))

scala> q.show

+---------+---+

| codes| cd|

+---------+---+

|[1, 2, 3]| 2|

| [1]| 1|

+---------+---+

`map_keys` Collection Function



map_keys(e: Column): Column

map_keys(e: Column): Column

map_keys…FIXME

`map_values` Collection Function



map_values(e: Column): Column

map_values(e: Column): Column

map_values…FIXME

Standard Functions — functions Object

2012-02-12admin阅读(1501)

Standard Functions — functions Object

org.apache.spark.sql.functions object defines built-in standard functions to work with (values produced by) columns.

You can access the standard functions using the following import statement in your Scala application:



import org.apache.spark.sql.functions._

import org.apache.spark.sql.functions._

Name Description

Aggregate functions

approx_count_distinct



approx_count_distinct(e: Column): Column
approx_count_distinct(columnName: String): Column
approx_count_distinct(e: Column, rsd: Double): Column
approx_count_distinct(columnName: String, rsd: Double): Column

approx_count_distinct(e: Column): Column

approx_count_distinct(columnName: String): Column

approx_count_distinct(e: Column, rsd: Double): Column

approx_count_distinct(columnName: String, rsd: Double): Column

avg



avg(e: Column): Column
avg(columnName: String): Column

avg(e: Column): Column

avg(columnName: String): Column

collect_list



collect_list(e: Column): Column
collect_list(columnName: String): Column

collect_list(e: Column): Column

collect_list(columnName: String): Column

collect_set



collect_set(e: Column): Column
collect_set(columnName: String): Column

collect_set(e: Column): Column

collect_set(columnName: String): Column

corr



corr(column1: Column, column2: Column): Column
corr(columnName1: String, columnName2: String): Column

corr(column1: Column, column2: Column): Column

corr(columnName1: String, columnName2: String): Column

count



count(e: Column): Column
count(columnName: String): TypedColumn[Any, Long]

count(e: Column): Column

count(columnName: String): TypedColumn[Any, Long]

countDistinct



countDistinct(expr: Column, exprs: Column*): Column
countDistinct(columnName: String, columnNames: String*): Column

countDistinct(expr: Column, exprs: Column*): Column

countDistinct(columnName: String, columnNames: String*): Column

covar_pop



covar_pop(column1: Column, column2: Column): Column
covar_pop(columnName1: String, columnName2: String): Column

covar_pop(column1: Column, column2: Column): Column

covar_pop(columnName1: String, columnName2: String): Column

covar_samp



covar_samp(column1: Column, column2: Column): Column
covar_samp(columnName1: String, columnName2: String): Column

covar_samp(column1: Column, column2: Column): Column

covar_samp(columnName1: String, columnName2: String): Column

first



first(e: Column): Column
first(e: Column, ignoreNulls: Boolean): Column
first(columnName: String): Column
first(columnName: String, ignoreNulls: Boolean): Column

first(e: Column): Column

first(e: Column, ignoreNulls: Boolean): Column

first(columnName: String): Column

first(columnName: String, ignoreNulls: Boolean): Column

Returns the first value in a group. Returns the first non-null value when ignoreNulls flag on. If all values are null, then returns null.

grouping



grouping(e: Column): Column
grouping(columnName: String): Column

grouping(e: Column): Column

grouping(columnName: String): Column

Indicates whether a given column is aggregated or not

grouping_id



grouping_id(cols: Column*): Column
grouping_id(colName: String, colNames: String*): Column

grouping_id(cols: Column*): Column

grouping_id(colName: String, colNames: String*): Column

Computes the level of grouping

kurtosis



kurtosis(e: Column): Column
kurtosis(columnName: String): Column

kurtosis(e: Column): Column

kurtosis(columnName: String): Column

last



last(e: Column, ignoreNulls: Boolean): Column
last(columnName: String, ignoreNulls: Boolean): Column
last(e: Column): Column
last(columnName: String): Column

last(e: Column, ignoreNulls: Boolean): Column

last(columnName: String, ignoreNulls: Boolean): Column

last(e: Column): Column

last(columnName: String): Column

max



max(e: Column): Column
max(columnName: String): Column

max(e: Column): Column

max(columnName: String): Column

mean



mean(e: Column): Column
mean(columnName: String): Column

mean(e: Column): Column

mean(columnName: String): Column

min



min(e: Column): Column
min(columnName: String): Column

min(e: Column): Column

min(columnName: String): Column

skewness



skewness(e: Column): Column
skewness(columnName: String): Column

skewness(e: Column): Column

skewness(columnName: String): Column

stddev



stddev(e: Column): Column
stddev(columnName: String): Column

stddev(e: Column): Column

stddev(columnName: String): Column

stddev_pop



stddev_pop(e: Column): Column
stddev_pop(columnName: String): Column

stddev_pop(e: Column): Column

stddev_pop(columnName: String): Column

stddev_samp



stddev_samp(e: Column): Column
stddev_samp(columnName: String): Column

stddev_samp(e: Column): Column

stddev_samp(columnName: String): Column

sum



sum(e: Column): Column
sum(columnName: String): Column

sum(e: Column): Column

sum(columnName: String): Column

sumDistinct



sumDistinct(e: Column): Column
sumDistinct(columnName: String): Column

sumDistinct(e: Column): Column

sumDistinct(columnName: String): Column

variance



variance(e: Column): Column
variance(columnName: String): Column

variance(e: Column): Column

variance(columnName: String): Column

var_pop



var_pop(e: Column): Column
var_pop(columnName: String): Column

var_pop(e: Column): Column

var_pop(columnName: String): Column

var_samp



var_samp(e: Column): Column
var_samp(columnName: String): Column

var_samp(e: Column): Column

var_samp(columnName: String): Column

Collection functions

array_contains



array_contains(column: Column, value: Any): Column

array_contains(column: Column, value: Any): Column

array_distinct



array_distinct(e: Column): Column

array_distinct(e: Column): Column

(New in 2.4.0)

array_except



array_except(e: Column): Column

array_except(e: Column): Column

(New in 2.4.0)

array_intersect



array_intersect(col1: Column, col2: Column): Column

array_intersect(col1: Column, col2: Column): Column

(New in 2.4.0)

array_join



array_join(column: Column, delimiter: String): Column
array_join(column: Column, delimiter: String, nullReplacement: String): Column

array_join(column: Column, delimiter: String): Column

array_join(column: Column, delimiter: String, nullReplacement: String): Column

(New in 2.4.0)

array_max



array_max(e: Column): Column

array_max(e: Column): Column

(New in 2.4.0)

array_min



array_min(e: Column): Column

array_min(e: Column): Column

(New in 2.4.0)

array_position



array_position(column: Column, value: Any): Column

array_position(column: Column, value: Any): Column

(New in 2.4.0)

array_remove



array_remove(column: Column, element: Any): Column

array_remove(column: Column, element: Any): Column

(New in 2.4.0)

array_repeat



array_repeat(e: Column, count: Int): Column
array_repeat(left: Column, right: Column): Column

array_repeat(e: Column, count: Int): Column

array_repeat(left: Column, right: Column): Column

(New in 2.4.0)

array_sort



array_sort(e: Column): Column

array_sort(e: Column): Column

(New in 2.4.0)

array_union



array_union(col1: Column, col2: Column): Column

array_union(col1: Column, col2: Column): Column

(New in 2.4.0)

arrays_zip



arrays_zip(e: Column*): Column

arrays_zip(e: Column*): Column

(New in 2.4.0)

arrays_overlap



arrays_overlap(a1: Column, a2: Column): Column

arrays_overlap(a1: Column, a2: Column): Column

(New in 2.4.0)

element_at



element_at(column: Column, value: Any): Column

element_at(column: Column, value: Any): Column

(New in 2.4.0)

explode



explode(e: Column): Column

explode(e: Column): Column

explode_outer



explode_outer(e: Column): Column

explode_outer(e: Column): Column

Creates a new row for each element in the given array or map column. If the array/map is null or empty then null is produced.

flatten



flatten(e: Column): Column

flatten(e: Column): Column

(New in 2.4.0)

from_json



from_json(e: Column, schema: Column): Column (1)
from_json(e: Column, schema: DataType): Column
from_json(e: Column, schema: DataType, options: Map[String, String]): Column
from_json(e: Column, schema: String, options: Map[String, String]): Column
from_json(e: Column, schema: StructType): Column
from_json(e: Column, schema: StructType, options: Map[String, String]): Column

from_json(e: Column, schema: Column): Column (1)

from_json(e: Column, schema: DataType): Column

from_json(e: Column, schema: DataType, options: Map[String, String]): Column

from_json(e: Column, schema: String, options: Map[String, String]): Column

from_json(e: Column, schema: StructType): Column

from_json(e: Column, schema: StructType, options: Map[String, String]): Column

New in 2.4.0

Parses a column with a JSON string into a StructType or ArrayType of StructType elements with the specified schema.

map_concat



map_concat(cols: Column*): Column

map_concat(cols: Column*): Column

(New in 2.4.0)

map_from_entries



map_from_entries(e: Column): Column

map_from_entries(e: Column): Column

(New in 2.4.0)

map_keys



map_keys(e: Column): Column

map_keys(e: Column): Column

map_values



map_values(e: Column): Column

map_values(e: Column): Column

posexplode



posexplode(e: Column): Column

posexplode(e: Column): Column

posexplode_outer



posexplode_outer(e: Column): Column

posexplode_outer(e: Column): Column

reverse



reverse(e: Column): Column

reverse(e: Column): Column

Returns a reversed string or an array with reverse order of elements

Note	Support for reversing arrays is new in 2.4.0.

schema_of_json



schema_of_json(json: Column): Column
schema_of_json(json: String): Column

schema_of_json(json: Column): Column

schema_of_json(json: String): Column

(New in 2.4.0)

sequence



sequence(start: Column, stop: Column): Column
sequence(start: Column, stop: Column, step: Column): Column

sequence(start: Column, stop: Column): Column

sequence(start: Column, stop: Column, step: Column): Column

(New in 2.4.0)

shuffle



shuffle(e: Column): Column

shuffle(e: Column): Column

(New in 2.4.0)

size



size(e: Column): Column

size(e: Column): Column

Returns the size of the given array or map. Returns -1 if null.

slice



slice(x: Column, start: Int, length: Int): Column

slice(x: Column, start: Int, length: Int): Column

(New in 2.4.0)

Date and time functions

current_date



current_date(): Column

current_date(): Column

current_timestamp



current_timestamp(): Column

current_timestamp(): Column

from_utc_timestamp



from_utc_timestamp(ts: Column, tz: String): Column
from_utc_timestamp(ts: Column, tz: Column): Column  (1)

from_utc_timestamp(ts: Column, tz: String): Column

from_utc_timestamp(ts: Column, tz: Column): Column (1)

New in 2.4.0

months_between



months_between(end: Column, start: Column): Column
months_between(end: Column, start: Column, roundOff: Boolean): Column (1)

months_between(end: Column, start: Column): Column

months_between(end: Column, start: Column, roundOff: Boolean): Column (1)

New in 2.4.0

to_date



to_date(e: Column): Column
to_date(e: Column, fmt: String): Column

to_date(e: Column): Column

to_date(e: Column, fmt: String): Column

to_timestamp



to_timestamp(s: Column): Column
to_timestamp(s: Column, fmt: String): Column

to_timestamp(s: Column): Column

to_timestamp(s: Column, fmt: String): Column

to_utc_timestamp



to_utc_timestamp(ts: Column, tz: String): Column
to_utc_timestamp(ts: Column, tz: Column): Column (1)

to_utc_timestamp(ts: Column, tz: String): Column

to_utc_timestamp(ts: Column, tz: Column): Column (1)

New in 2.4.0

unix_timestamp

Converts current or specified time to Unix timestamp (in seconds)



unix_timestamp(): Column
unix_timestamp(s: Column): Column
unix_timestamp(s: Column, p: String): Column

unix_timestamp(): Column

unix_timestamp(s: Column): Column

unix_timestamp(s: Column, p: String): Column

window

Generates tumbling time windows



window(
  timeColumn: Column,
  windowDuration: String): Column
window(
  timeColumn: Column,
  windowDuration: String,
  slideDuration: String): Column
window(
  timeColumn: Column,
  windowDuration: String,
  slideDuration: String,
  startTime: String): Column

window(

timeColumn: Column,

windowDuration: String): Column

window(

timeColumn: Column,

windowDuration: String,

slideDuration: String): Column

window(

timeColumn: Column,

windowDuration: String,

slideDuration: String,

startTime: String): Column

Math functions

bin

Converts the value of a long column to binary format

Regular functions (Non-aggregate functions)

array

broadcast

coalesce

Gives the first non-null value among the given columns or null

col and column

Creating Columns

expr

lit

map

monotonically_increasing_id

Returns monotonically increasing 64-bit integers that are guaranteed to be monotonically increasing and unique, but not consecutive.

String functions

UDF functions

Creating UDFs

Executing an UDF by name with variable-length list of columns

Window functions

cume_dist



cume_dist(): Column

cume_dist(): Column

Computes the cumulative distribution of records across window partitions

currentRow



currentRow(): Column

currentRow(): Column

dense_rank



dense_rank(): Column

dense_rank(): Column

Computes the rank of records per window partition

lag



lag(e: Column, offset: Int): Column
lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column

lag(e: Column, offset: Int): Column

lag(columnName: String, offset: Int): Column

lag(columnName: String, offset: Int, defaultValue: Any): Column

lead



lead(columnName: String, offset: Int): Column
lead(e: Column, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column
lead(e: Column, offset: Int, defaultValue: Any): Column

lead(columnName: String, offset: Int): Column

lead(e: Column, offset: Int): Column

lead(columnName: String, offset: Int, defaultValue: Any): Column

lead(e: Column, offset: Int, defaultValue: Any): Column

ntile



ntile(n: Int): Column

ntile(n: Int): Column

Computes the ntile group

percent_rank



percent_rank(): Column

percent_rank(): Column

Computes the rank of records per window partition

rank



rank(): Column

rank(): Column

Computes the rank of records per window partition

row_number



row_number(): Column

row_number(): Column

Computes the sequential numbering per window partition

unboundedFollowing



unboundedFollowing(): Column

unboundedFollowing(): Column

unboundedPreceding



unboundedPreceding(): Column

unboundedPreceding(): Column

Tip	The page gives only a brief ovierview of the many functions available in `functions` object and so you should read the official documentation of the `functions` object.

Executing UDF by Name and Variable-Length Column List — `callUDF` Function



callUDF(udfName: String, cols: Column*): Column

callUDF(udfName: String, cols: Column*): Column

callUDF executes an UDF by udfName and variable-length list of columns.

Defining UDFs — `udf` Function



udf(f: FunctionN[...]): UserDefinedFunction

udf(f: FunctionN[...]): UserDefinedFunction

The udf family of functions allows you to create user-defined functions (UDFs) based on a user-defined function in Scala. It accepts f function of 0 to 10 arguments and the input and output types are automatically inferred (given the types of the respective input and output types of the function f).



import org.apache.spark.sql.functions._
val _length: String => Int = _.length
val _lengthUDF = udf(_length)

// define a dataframe
val df = sc.parallelize(0 to 3).toDF("num")

// apply the user-defined function to "num" column
scala> df.withColumn("len", _lengthUDF($"num")).show
+---+---+
|num|len|
+---+---+
|  0|  1|
|  1|  1|
|  2|  1|
|  3|  1|
+---+---+

import org.apache.spark.sql.functions._

val _length: String => Int = _.length

val _lengthUDF = udf(_length)

// define a dataframe

val df = sc.parallelize(0 to 3).toDF("num")

// apply the user-defined function to "num" column

scala> df.withColumn("len", _lengthUDF($"num")).show

+---+---+

|num|len|

+---+---+

| 0| 1|

| 1| 1|

| 2| 1|

| 3| 1|

+---+---+

Since Spark 2.0.0, there is another variant of udf function:



udf(f: AnyRef, dataType: DataType): UserDefinedFunction

udf(f: AnyRef, dataType: DataType): UserDefinedFunction

udf(f: AnyRef, dataType: DataType) allows you to use a Scala closure for the function argument (as f) and explicitly declaring the output data type (as dataType).



// given the dataframe above

import org.apache.spark.sql.types.IntegerType
val byTwo = udf((n: Int) => n * 2, IntegerType)

scala> df.withColumn("len", byTwo($"num")).show
+---+---+
|num|len|
+---+---+
|  0|  0|
|  1|  2|
|  2|  4|
|  3|  6|
+---+---+

// given the dataframe above

import org.apache.spark.sql.types.IntegerType

val byTwo = udf((n: Int) => n * 2, IntegerType)

scala> df.withColumn("len", byTwo($"num")).show

+---+---+

|num|len|

+---+---+

| 0| 0|

| 1| 2|

| 2| 4|

| 3| 6|

+---+---+

`split` Function



split(str: Column, pattern: String): Column

split(str: Column, pattern: String): Column

split function splits str column using pattern. It returns a new Column.

Note	`split` UDF uses java.lang.String.split(String regex, int limit) method.



val df = Seq((0, "hello|world"), (1, "witaj|swiecie")).toDF("num", "input")
val withSplit = df.withColumn("split", split($"input", "[|]"))

scala> withSplit.show
+---+-------------+----------------+
|num|        input|           split|
+---+-------------+----------------+
|  0|  hello|world|  [hello, world]|
|  1|witaj|swiecie|[witaj, swiecie]|
+---+-------------+----------------+

val df = Seq((0, "hello|world"), (1, "witaj|swiecie")).toDF("num", "input")

val withSplit = df.withColumn("split", split($"input", "[|]"))

scala> withSplit.show

+---+-------------+----------------+

|num| input| split|

+---+-------------+----------------+

+---+-------------+----------------+

Note	`.$\|()[{^?*+\` are RegEx’s meta characters and are considered special.

`upper` Function



upper(e: Column): Column

upper(e: Column): Column

upper function converts a string column into one with all letter upper. It returns a new Column.

Note	The following example uses two functions that accept a `Column` and return another to showcase how to chain them.



val df = Seq((0,1,"hello"), (2,3,"world"), (2,4, "ala")).toDF("id", "val", "name")
val withUpperReversed = df.withColumn("upper", reverse(upper($"name")))

scala> withUpperReversed.show
+---+---+-----+-----+
| id|val| name|upper|
+---+---+-----+-----+
|  0|  1|hello|OLLEH|
|  2|  3|world|DLROW|
|  2|  4|  ala|  ALA|
+---+---+-----+-----+

val df = Seq((0,1,"hello"), (2,3,"world"), (2,4, "ala")).toDF("id", "val", "name")

val withUpperReversed = df.withColumn("upper", reverse(upper($"name")))

scala> withUpperReversed.show

+---+---+-----+-----+

| id|val| name|upper|

+---+---+-----+-----+

| 0| 1|hello|OLLEH|

| 2| 3|world|DLROW|

| 2| 4| ala| ALA|

+---+---+-----+-----+

Converting Long to Binary Format (in String Representation) — `bin` Function



bin(e: Column): Column
bin(columnName: String): Column (1)

bin(e: Column): Column

bin(columnName: String): Column (1)

Calls the first bin with columnName as a Column

bin converts the long value in a column to its binary format (i.e. as an unsigned integer in base 2) with no extra leading 0s.



scala> spark.range(5).withColumn("binary", bin('id)).show
+---+------+
| id|binary|
+---+------+
|  0|     0|
|  1|     1|
|  2|    10|
|  3|    11|
|  4|   100|
+---+------+

val withBin = spark.range(5).withColumn("binary", bin('id))
scala> withBin.printSchema
root
 |-- id: long (nullable = false)
 |-- binary: string (nullable = false)

scala> spark.range(5).withColumn("binary", bin('id)).show

+---+------+

| id|binary|

+---+------+

| 0| 0|

| 1| 1|

| 2| 10|

| 3| 11|

| 4| 100|

+---+------+

val withBin = spark.range(5).withColumn("binary", bin('id))

scala> withBin.printSchema

root

|-- id: long (nullable = false)

|-- binary: string (nullable = false)

Internally, bin creates a Column with Bin unary expression.



scala> withBin.queryExecution.logical
res2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [*, bin('id) AS binary#14]
+- Range (0, 5, step=1, splits=Some(8))

scala> withBin.queryExecution.logical

res2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =

'Project [*, bin('id) AS binary#14]

+- Range (0, 5, step=1, splits=Some(8))

Note	`Bin` unary expression uses java.lang.Long.toBinaryString for the conversion.

Note

Bin expression supports code generation (aka CodeGen).



val withBin = spark.range(5).withColumn("binary", bin('id))
scala> withBin.queryExecution.debug.codegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*Project [id#19L, bin(id#19L) AS binary#22]
+- *Range (0, 5, step=1, splits=Some(8))
...
/* 103 */           UTF8String project_value1 = null;
/* 104 */           project_value1 = UTF8String.fromString(java.lang.Long.toBinaryString(range_value));

val withBin = spark.range(5).withColumn("binary", bin('id))

scala> withBin.queryExecution.debug.codegen

Found 1 WholeStageCodegen subtrees.

== Subtree 1 / 1 ==

*Project [id#19L, bin(id#19L) AS binary#22]

+- *Range (0, 5, step=1, splits=Some(8))

...

/* 103 */ UTF8String project_value1 = null;

/* 104 */ project_value1 = UTF8String.fromString(java.lang.Long.toBinaryString(range_value));

Window Utility Object — Defining Window Specification

2012-02-11admin阅读(1570)

Window Utility Object — Defining Window Specification

Window utility object is a set of static methods to define a window specification.

Method Description

currentRow



currentRow: Long

currentRow: Long

Value representing the current row that is used to define frame boundaries.

orderBy



orderBy(cols: Column*): WindowSpec
orderBy(colName: String, colNames: String*): WindowSpec

orderBy(cols: Column*): WindowSpec

orderBy(colName: String, colNames: String*): WindowSpec

Creates a WindowSpec with the ordering defined.

partitionBy



partitionBy(cols: Column*): WindowSpec
partitionBy(colName: String, colNames: String*): WindowSpec

partitionBy(cols: Column*): WindowSpec

partitionBy(colName: String, colNames: String*): WindowSpec

Creates a WindowSpec with the partitioning defined.

rangeBetween



rangeBetween(start: Column, end: Column): WindowSpec
rangeBetween(start: Long, end: Long): WindowSpec

rangeBetween(start: Column, end: Column): WindowSpec

rangeBetween(start: Long, end: Long): WindowSpec

Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive). Both start and end are relative to the current row based on the actual value of the ORDER BY expression(s).

rowsBetween



rowsBetween(start: Long, end: Long): WindowSpec

rowsBetween(start: Long, end: Long): WindowSpec

Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive). Both start and end are positions relative to the current row based on the position of the row within the partition.

unboundedFollowing



unboundedFollowing: Long

unboundedFollowing: Long

Value representing the last row in a partition (equivalent to “UNBOUNDED FOLLOWING” in SQL) that is used to define frame boundaries.

unboundedPreceding



unboundedPreceding: Long

unboundedPreceding: Long

Value representing the first row in a partition (equivalent to “UNBOUNDED PRECEDING” in SQL) that is used to define frame boundaries.



import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{currentRow, lit}
val windowSpec = Window
  .partitionBy($"orderId")
  .orderBy($"time")
  .rangeBetween(currentRow, lit(1))
scala> :type windowSpec
org.apache.spark.sql.expressions.WindowSpec

import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.functions.{currentRow, lit}

val windowSpec = Window

.partitionBy($"orderId")

.orderBy($"time")

.rangeBetween(currentRow, lit(1))

scala> :type windowSpec

org.apache.spark.sql.expressions.WindowSpec

Creating “Empty” WindowSpec — `spec` Internal Method



spec: WindowSpec

spec: WindowSpec

spec creates an “empty” WindowSpec, i.e. with empty partition and ordering specifications, and a UnspecifiedFrame.

Note	`spec` is used when: Column.over operator is used (with no `WindowSpec`) `Window` utility object is requested to partitionBy, orderBy, rowsBetween and rangeBetween

WindowSpec — Window Specification

2012-02-10admin阅读(1539)

WindowSpec — Window Specification

WindowSpec is a window specification that defines which rows are included in a window (frame), i.e. the set of rows that are associated with the current row by some relation.

WindowSpec takes the following when created:

Partition specification (Seq[Expression]) which defines which records are in the same partition. With no partition defined, all records belong to a single partition
Ordering Specification (Seq[SortOrder]) which defines how records in a partition are ordered that in turn defines the position of a record in a partition. The ordering could be ascending (ASC in SQL or asc in Scala) or descending (DESC or desc).
Frame Specification (WindowFrame) which defines the rows to be included in the frame for the current row, based on their relative position to the current row. For example, “the three rows preceding the current row to the current row” describes a frame including the current input row and three rows appearing before the current row.

You use Window object to create a WindowSpec.



import org.apache.spark.sql.expressions.Window
scala> val byHTokens = Window.partitionBy('token startsWith "h")
byHTokens: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@574985d8

import org.apache.spark.sql.expressions.Window

scala> val byHTokens = Window.partitionBy('token startsWith "h")

byHTokens: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@574985d8

Once the initial version of a WindowSpec is created, you use the methods to further configure the window specification.

Method Description

orderBy



orderBy(cols: Column*): WindowSpec
orderBy(colName: String, colNames: String*): WindowSpec

orderBy(cols: Column*): WindowSpec

orderBy(colName: String, colNames: String*): WindowSpec

partitionBy



partitionBy(cols: Column*): WindowSpec
partitionBy(colName: String, colNames: String*): WindowSpec

partitionBy(cols: Column*): WindowSpec

partitionBy(colName: String, colNames: String*): WindowSpec

rangeBetween



rangeBetween(start: Column, end: Column): WindowSpec
rangeBetween(start: Long, end: Long): WindowSpec

rangeBetween(start: Column, end: Column): WindowSpec

rangeBetween(start: Long, end: Long): WindowSpec

rowsBetween



rowsBetween(start: Long, end: Long): WindowSpec

rowsBetween(start: Long, end: Long): WindowSpec

With a window specification fully defined, you use Column.over operator that associates the WindowSpec with an aggregate or window function.



scala> :type windowSpec
org.apache.spark.sql.expressions.WindowSpec

import org.apache.spark.sql.functions.rank
val c = rank over windowSpec

scala> :type windowSpec

org.apache.spark.sql.expressions.WindowSpec

import org.apache.spark.sql.functions.rank

val c = rank over windowSpec

`withAggregate` Internal Method



withAggregate(aggregate: Column): Column

withAggregate(aggregate: Column): Column

withAggregate…FIXME

Note	`withAggregate` is used exclusively when Column.over operator is used.

Window Aggregation

2012-02-09admin阅读(1002)

Window Aggregation

Window Aggregation is…FIXME

From Structured Query to Physical Plan

Spark Analyzer uses ExtractWindowExpressions logical resolution rule to replace (extract) WindowExpression expressions with Window logical operators in a logical query plan.

Note	Window —> (BasicOperators) —> WindowExec —> WindowExec.adoc#doExecute (and windowExecBufferInMemoryThreshold + windowExecBufferSpillThreshold)

上一页
1
···
51
52
53
54
55
56
57
下一页
共 58 页

spark-sql 第54页

UserDefinedFunction

Executing UserDefinedFunction (Creating Column with ScalaUDF Expression) — apply Method

Marking UserDefinedFunction as NonNullable — asNonNullable Method

Naming UserDefinedFunction — withName Method

Creating UserDefinedFunction Instance

UDFs — User-Defined Functions

udf Functions (in functions object)

Standard Functions for Window Aggregation (Window Functions)

Window object

Partitioning Records — partitionBy Methods

Ordering in Windows — orderBy Methods

rangeBetween Method

Frame

Window Operators in SQL Queries

Examples

Top N per Group

Revenue Difference per Category

Difference on Column

Running Total

Calculate rank of row

Interval data type for Date and Timestamp types

Accessing values of earlier rows

Moving Average

Cumulative Aggregates

User-defined aggregate functions

“Explaining” Query Plans of Windows

lag Window Function

lead Window Function

Cumulative Distribution of Records Across Window Partitions — cume_dist Window Function

Sequential numbering per window partition — row_number Window Function

ntile Window Function

Ranking Records per Window Partition — rank Window Function

currentRow Window Function

unboundedFollowing Window Function

unboundedPreceding Window Function

Further Reading and Watching

Regular Functions (Non-Aggregate Functions)

broadcast Function

coalesce Function

Example: coalesce Function

Creating Columns — col and column Functions

expr Function

lit Function

struct Functions

typedLit Function

array Function

map Function

when Function

monotonically_increasing_id Function

Date and Time Functions

Current Date As Date Column — current_date Function

date_format Function

current_timestamp Function

Converting Current or Specified Time to Unix Timestamp — unix_timestamp Function

Generating Time Windows — window Function

Example — Traffic Sensor

Converting Column To DateType — to_date Function

Converting Column To TimestampType — to_timestamp Function

Standard Functions for Collections (Collection Functions)

reverse Collection Function

size Collection Function

posexplode Collection Function

posexplode_outer Collection Function

explode Collection Function

explode_outer Collection Function

Extracting Data from Arbitrary JSON-Encoded Values — from_json Collection Function

array_contains Collection Function

map_keys Collection Function

map_values Collection Function

Standard Functions — functions Object

Executing UDF by Name and Variable-Length Column List — callUDF Function

Defining UDFs — udf Function

split Function

upper Function

Converting Long to Binary Format (in String Representation) — bin Function

Window Utility Object — Defining Window Specification

Creating “Empty” WindowSpec — spec Internal Method

WindowSpec — Window Specification

withAggregate Internal Method

Executing UserDefinedFunction (Creating Column with ScalaUDF Expression) — `apply` Method

Marking UserDefinedFunction as NonNullable — `asNonNullable` Method

Naming UserDefinedFunction — `withName` Method

Partitioning Records — `partitionBy` Methods

Ordering in Windows — `orderBy` Methods

`rangeBetween` Method

`lag` Window Function

`lead` Window Function

Cumulative Distribution of Records Across Window Partitions — `cume_dist` Window Function

Sequential numbering per window partition — `row_number` Window Function

`ntile` Window Function

Ranking Records per Window Partition — `rank` Window Function

`currentRow` Window Function

`unboundedFollowing` Window Function

`unboundedPreceding` Window Function

`broadcast` Function

`coalesce` Function

Example: `coalesce` Function

Creating Columns — `col` and `column` Functions

`expr` Function

`lit` Function

`struct` Functions

`typedLit` Function

`array` Function

`map` Function

`when` Function

`monotonically_increasing_id` Function

Current Date As Date Column — `current_date` Function

`date_format` Function

`current_timestamp` Function

Converting Current or Specified Time to Unix Timestamp — `unix_timestamp` Function

Generating Time Windows — `window` Function

Converting Column To DateType — `to_date` Function

Converting Column To TimestampType — `to_timestamp` Function

`reverse` Collection Function

`size` Collection Function

`posexplode` Collection Function

`posexplode_outer` Collection Function

`explode` Collection Function

`explode_outer` Collection Function

Extracting Data from Arbitrary JSON-Encoded Values — `from_json` Collection Function

`array_contains` Collection Function

`map_keys` Collection Function

`map_values` Collection Function

Executing UDF by Name and Variable-Length Column List — `callUDF` Function

Defining UDFs — `udf` Function

`split` Function

`upper` Function

Converting Long to Binary Format (in String Representation) — `bin` Function

Creating “Empty” WindowSpec — `spec` Internal Method

`withAggregate` Internal Method