关注 spark技术分享,
撸spark源码 玩spark最佳实践

Regular Functions (Non-Aggregate Functions)

Regular Functions (Non-Aggregate Functions)

Table 1. (Subset of) Regular Functions
Name Description

array

broadcast

coalesce

Gives the first non-null value among the given columns or null.

col and column

Creating Columns

expr

lit

map

monotonically_increasing_id

struct

typedLit

when

broadcast Function

broadcast function marks the input Dataset as small enough to be used in broadcast join.

Note
broadcast standard function is a special case of Dataset.hint operator that allows for attaching any hint to a logical plan.

coalesce Function

coalesce gives the first non-null value among the given columns or null.

coalesce requires at least one column and all columns have to be of the same or compatible types.

Internally, coalesce creates a Column with a Coalesce expression (with the children being the expressions of the input Column).

Example: coalesce Function

Creating Columns — col and column Functions

col and column methods create a Column that you can later use to reference a column in a dataset.

expr Function

expr function parses the input expr SQL statement to a Column it represents.

Internally, expr uses the active session’s sqlParser or creates a new SparkSqlParser to call parseExpression method.

lit Function

lit function…​FIXME

struct Functions

struct family of functions allows you to create a new struct column based on a collection of Column or their names.

Note
The difference between struct and another similar array function is that the types of the columns can be different (in struct).

typedLit Function

typedLit…​FIXME

array Function

array…​FIXME

map Function

map…​FIXME

when Function

when…​FIXME

monotonically_increasing_id Function

monotonically_increasing_id returns monotonically increasing 64-bit integers. The generated IDs are guaranteed to be monotonically increasing and unique, but not consecutive (unless all rows are in the same single partition which you rarely want due to the amount of the data).

The current implementation uses the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. That assumes that the data set has less than 1 billion partitions, and each partition has less than 8 billion records.

Internally, monotonically_increasing_id creates a Column with a MonotonicallyIncreasingID non-deterministic leaf expression.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Regular Functions (Non-Aggregate Functions)
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏