Standard Functions — functions Object
org.apache.spark.sql.functions
object defines built-in standard functions to work with (values produced by) columns.
You can access the standard functions using the following import
statement in your Scala application:
1 2 3 4 5 |
import org.apache.spark.sql.functions._ |
Name | Description | ||||
---|---|---|---|---|---|
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
Returns the first value in a group. Returns the first non-null value when |
|||||
Indicates whether a given column is aggregated or not |
|||||
Computes the level of grouping |
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
|
|||||
Creates a new row for each element in the given array or map column. If the array/map is |
|||||
(New in 2.4.0) |
|||||
Parses a column with a JSON string into a StructType or ArrayType of |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
|
|||||
|
|||||
|
|||||
|
|||||
Returns a reversed string or an array with reverse order of elements
|
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
(New in 2.4.0) |
|||||
Returns the size of the given array or map. Returns -1 if |
|||||
(New in 2.4.0) |
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
Math functions |
Converts the value of a long column to binary format |
||||
Regular functions (Non-aggregate functions) |
|||||
Gives the first non- |
|||||
Creating Columns |
|||||
Returns monotonically increasing 64-bit integers that are guaranteed to be monotonically increasing and unique, but not consecutive. |
|||||
String functions |
|||||
UDF functions |
Creating UDFs |
||||
Executing an UDF by name with variable-length list of columns |
|||||
Computes the cumulative distribution of records across window partitions |
|||||
|
|||||
Computes the rank of records per window partition |
|||||
|
|||||
|
|||||
Computes the ntile group |
|||||
Computes the rank of records per window partition |
|||||
Computes the rank of records per window partition |
|||||
Computes the sequential numbering per window partition |
|||||
|
|||||
|
Tip
|
The page gives only a brief ovierview of the many functions available in functions object and so you should read the official documentation of the functions object.
|
Executing UDF by Name and Variable-Length Column List — callUDF
Function
1 2 3 4 5 |
callUDF(udfName: String, cols: Column*): Column |
callUDF
executes an UDF by udfName
and variable-length list of columns.
Defining UDFs — udf
Function
1 2 3 4 5 |
udf(f: FunctionN[...]): UserDefinedFunction |
The udf
family of functions allows you to create user-defined functions (UDFs) based on a user-defined function in Scala. It accepts f
function of 0 to 10 arguments and the input and output types are automatically inferred (given the types of the respective input and output types of the function f
).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import org.apache.spark.sql.functions._ val _length: String => Int = _.length val _lengthUDF = udf(_length) // define a dataframe val df = sc.parallelize(0 to 3).toDF("num") // apply the user-defined function to "num" column scala> df.withColumn("len", _lengthUDF($"num")).show +---+---+ |num|len| +---+---+ | 0| 1| | 1| 1| | 2| 1| | 3| 1| +---+---+ |
Since Spark 2.0.0, there is another variant of udf
function:
1 2 3 4 5 |
udf(f: AnyRef, dataType: DataType): UserDefinedFunction |
udf(f: AnyRef, dataType: DataType)
allows you to use a Scala closure for the function argument (as f
) and explicitly declaring the output data type (as dataType
).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
// given the dataframe above import org.apache.spark.sql.types.IntegerType val byTwo = udf((n: Int) => n * 2, IntegerType) scala> df.withColumn("len", byTwo($"num")).show +---+---+ |num|len| +---+---+ | 0| 0| | 1| 2| | 2| 4| | 3| 6| +---+---+ |
split
Function
1 2 3 4 5 |
split(str: Column, pattern: String): Column |
split
function splits str
column using pattern
. It returns a new Column
.
Note
|
split UDF uses java.lang.String.split(String regex, int limit) method.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
val df = Seq((0, "hello|world"), (1, "witaj|swiecie")).toDF("num", "input") val withSplit = df.withColumn("split", split($"input", "[|]")) scala> withSplit.show +---+-------------+----------------+ |num| input| split| +---+-------------+----------------+ | 0| hello|world| [hello, world]| | 1|witaj|swiecie|[witaj, swiecie]| +---+-------------+----------------+ |
Note
|
.$|()[{^?*+\ are RegEx’s meta characters and are considered special.
|
upper
Function
1 2 3 4 5 |
upper(e: Column): Column |
upper
function converts a string column into one with all letter upper. It returns a new Column
.
Note
|
The following example uses two functions that accept a Column and return another to showcase how to chain them.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
val df = Seq((0,1,"hello"), (2,3,"world"), (2,4, "ala")).toDF("id", "val", "name") val withUpperReversed = df.withColumn("upper", reverse(upper($"name"))) scala> withUpperReversed.show +---+---+-----+-----+ | id|val| name|upper| +---+---+-----+-----+ | 0| 1|hello|OLLEH| | 2| 3|world|DLROW| | 2| 4| ala| ALA| +---+---+-----+-----+ |
Converting Long to Binary Format (in String Representation) — bin
Function
1 2 3 4 5 6 |
bin(e: Column): Column bin(columnName: String): Column (1) |
-
Calls the first
bin
withcolumnName
as aColumn
bin
converts the long value in a column to its binary format (i.e. as an unsigned integer in base 2) with no extra leading 0s.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
scala> spark.range(5).withColumn("binary", bin('id)).show +---+------+ | id|binary| +---+------+ | 0| 0| | 1| 1| | 2| 10| | 3| 11| | 4| 100| +---+------+ val withBin = spark.range(5).withColumn("binary", bin('id)) scala> withBin.printSchema root |-- id: long (nullable = false) |-- binary: string (nullable = false) |
Internally, bin
creates a Column with Bin
unary expression.
1 2 3 4 5 6 7 8 |
scala> withBin.queryExecution.logical res2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 'Project [*, bin('id) AS binary#14] +- Range (0, 5, step=1, splits=Some(8)) |
Note
|
Bin unary expression uses java.lang.Long.toBinaryString for the conversion.
|
Note
|
|