Dataset API — Untyped Transformations
Untyped transformations are part of the Dataset API for transforming a Dataset to a DataFrame, a Column, a RelationalGroupedDataset, a DataFrameNaFunctions or a DataFrameStatFunctions (and hence untyped).
|
Note
|
Untyped transformations are the methods in the Dataset Scala class that are grouped in untypedrel group name, i.e. @group untypedrel.
|
| Transformation | Description | ||
|---|---|---|---|
|
|||
|
Selects a column based on the column name (i.e. maps a
|
|||
|
Selects a column based on the column name (i.e. maps a
|
|||
Selects a column based on the column name specified as a regex (i.e. maps a |
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
agg Untyped Transformation
|
1 2 3 4 5 6 7 |
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame agg(expr: Column, exprs: Column*): DataFrame agg(exprs: Map[String, String]): DataFrame |
agg…FIXME
apply Untyped Transformation
|
1 2 3 4 5 |
apply(colName: String): Column |
apply selects a column based on the column name (i.e. maps a Dataset onto a Column).
col Untyped Transformation
|
1 2 3 4 5 |
col(colName: String): Column |
col selects a column based on the column name (i.e. maps a Dataset onto a Column).
Internally, col branches off per the input column name.
If the column name is * (a star), col simply creates a Column with ResolvedStar expression (with the schema output attributes of the analyzed logical plan of the QueryExecution).
Otherwise, col uses colRegex untyped transformation when spark.sql.parser.quotedRegexColumnNames configuration property is enabled.
In the case when the column name is not * and spark.sql.parser.quotedRegexColumnNames configuration property is disabled, col creates a Column with the column name resolved (as a NamedExpression).
colRegex Untyped Transformation
|
1 2 3 4 5 |
colRegex(colName: String): Column |
colRegex selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column).
|
Note
|
colRegex is used in col when spark.sql.parser.quotedRegexColumnNames configuration property is enabled (and the column name is not *).
|
Internally, colRegex matches the input column name to different regular expressions (in the order):
-
For column names with quotes without a qualifier,
colRegexsimply creates a Column with a UnresolvedRegex (with no table) -
For column names with quotes with a qualifier,
colRegexsimply creates a Column with a UnresolvedRegex (with a table specified) -
For other column names,
colRegex(behaves like col and) creates a Column with the column name resolved (as a NamedExpression)
cube Untyped Transformation
|
1 2 3 4 5 6 |
cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset |
cube…FIXME
Dropping One or More Columns — drop Untyped Transformation
|
1 2 3 4 5 6 7 |
drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame |
drop…FIXME
groupBy Untyped Transformation
|
1 2 3 4 5 6 |
groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset |
groupBy…FIXME
join Untyped Transformation
|
1 2 3 4 5 6 7 8 9 10 |
join(right: Dataset[_]): DataFrame join(right: Dataset[_], usingColumn: String): DataFrame join(right: Dataset[_], usingColumns: Seq[String]): DataFrame join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[_], joinExprs: Column): DataFrame join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame |
join…FIXME
na Untyped Transformation
|
1 2 3 4 5 |
na: DataFrameNaFunctions |
na simply creates a DataFrameNaFunctions to work with missing data.
rollup Untyped Transformation
|
1 2 3 4 5 6 |
rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset |
rollup…FIXME
select Untyped Transformation
|
1 2 3 4 5 6 |
select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame |
select…FIXME
Projecting Columns using SQL Statements — selectExpr Untyped Transformation
|
1 2 3 4 5 |
selectExpr(exprs: String*): DataFrame |
selectExpr is like select, but accepts SQL statements.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
val ds = spark.range(5) scala> ds.selectExpr("rand() as random").show 16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random +-------------------+ | random| +-------------------+ | 0.887675894185651| |0.36766085091074086| | 0.2700020856675186| | 0.1489033635529543| | 0.5862990791950973| +-------------------+ |
Internally, it executes select with every expression in exprs mapped to Column (using SparkSqlParser.parseExpression).
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
scala> ds.select(expr("rand() as random")).show +------------------+ | random| +------------------+ |0.5514319279894851| |0.2876221510433741| |0.4599999092045741| |0.5708558868374893| |0.6223314406247136| +------------------+ |
stat Untyped Transformation
|
1 2 3 4 5 |
stat: DataFrameStatFunctions |
stat simply creates a DataFrameStatFunctions to work with statistic functions.
spark技术分享