关注 spark技术分享,
撸spark源码 玩spark最佳实践

Standard Functions — functions Object

Standard Functions — functions Object

org.apache.spark.sql.functions object defines built-in standard functions to work with (values produced by) columns.

You can access the standard functions using the following import statement in your Scala application:

Table 1. (Subset of) Standard Functions in Spark SQL
Name Description

Aggregate functions

approx_count_distinct

avg

collect_list

collect_set

corr

count

countDistinct

covar_pop

covar_samp

first

Returns the first value in a group. Returns the first non-null value when ignoreNulls flag on. If all values are null, then returns null.

grouping

Indicates whether a given column is aggregated or not

grouping_id

Computes the level of grouping

kurtosis

last

max

mean

min

skewness

stddev

stddev_pop

stddev_samp

sum

sumDistinct

variance

var_pop

var_samp

Collection functions

array_contains

array_distinct

(New in 2.4.0)

array_except

(New in 2.4.0)

array_intersect

(New in 2.4.0)

array_join

(New in 2.4.0)

array_max

(New in 2.4.0)

array_min

(New in 2.4.0)

array_position

(New in 2.4.0)

array_remove

(New in 2.4.0)

array_repeat

(New in 2.4.0)

array_sort

(New in 2.4.0)

array_union

(New in 2.4.0)

arrays_zip

(New in 2.4.0)

arrays_overlap

(New in 2.4.0)

element_at

(New in 2.4.0)

explode

explode_outer

Creates a new row for each element in the given array or map column. If the array/map is null or empty then null is produced.

flatten

(New in 2.4.0)

from_json

  1. New in 2.4.0

Parses a column with a JSON string into a StructType or ArrayType of StructType elements with the specified schema.

map_concat

(New in 2.4.0)

map_from_entries

(New in 2.4.0)

map_keys

map_values

posexplode

posexplode_outer

reverse

Returns a reversed string or an array with reverse order of elements

Note
Support for reversing arrays is new in 2.4.0.

schema_of_json

(New in 2.4.0)

sequence

(New in 2.4.0)

shuffle

(New in 2.4.0)

size

Returns the size of the given array or map. Returns -1 if null.

slice

(New in 2.4.0)

Date and time functions

current_date

current_timestamp

from_utc_timestamp

  1. New in 2.4.0

months_between

  1. New in 2.4.0

to_date

to_timestamp

to_utc_timestamp

  1. New in 2.4.0

unix_timestamp

Converts current or specified time to Unix timestamp (in seconds)

window

Generates tumbling time windows

Math functions

bin

Converts the value of a long column to binary format

Regular functions (Non-aggregate functions)

array

broadcast

coalesce

Gives the first non-null value among the given columns or null

col and column

Creating Columns

expr

lit

map

monotonically_increasing_id

Returns monotonically increasing 64-bit integers that are guaranteed to be monotonically increasing and unique, but not consecutive.

struct

typedLit

when

String functions

split

upper

UDF functions

udf

Creating UDFs

callUDF

Executing an UDF by name with variable-length list of columns

Window functions

cume_dist

Computes the cumulative distribution of records across window partitions

currentRow

dense_rank

Computes the rank of records per window partition

lag

lead

ntile

Computes the ntile group

percent_rank

Computes the rank of records per window partition

rank

Computes the rank of records per window partition

row_number

Computes the sequential numbering per window partition

unboundedFollowing

unboundedPreceding

Tip
The page gives only a brief ovierview of the many functions available in functions object and so you should read the official documentation of the functions object.

Executing UDF by Name and Variable-Length Column List — callUDF Function

callUDF executes an UDF by udfName and variable-length list of columns.

Defining UDFs — udf Function

The udf family of functions allows you to create user-defined functions (UDFs) based on a user-defined function in Scala. It accepts f function of 0 to 10 arguments and the input and output types are automatically inferred (given the types of the respective input and output types of the function f).

Since Spark 2.0.0, there is another variant of udf function:

udf(f: AnyRef, dataType: DataType) allows you to use a Scala closure for the function argument (as f) and explicitly declaring the output data type (as dataType).

split Function

split function splits str column using pattern. It returns a new Column.

Note
.$|()[{^?*+\ are RegEx’s meta characters and are considered special.

upper Function

upper function converts a string column into one with all letter upper. It returns a new Column.

Note
The following example uses two functions that accept a Column and return another to showcase how to chain them.

Converting Long to Binary Format (in String Representation) — bin Function

  1. Calls the first bin with columnName as a Column

bin converts the long value in a column to its binary format (i.e. as an unsigned integer in base 2) with no extra leading 0s.

Internally, bin creates a Column with Bin unary expression.

Note
Bin unary expression uses java.lang.Long.toBinaryString for the conversion.
Note

Bin expression supports code generation (aka CodeGen).

赞(0) 打赏
未经允许不得转载:spark技术分享 » Standard Functions — functions Object
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏