Dataset API — Dataset Operators-spark技术分享

Dataset API — Dataset Operators

Dataset API is a set of operators with typed and untyped transformations, and actions to work with a structured query (as a Dataset) as a whole.

Operator Description

agg



agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

agg(expr: Column, exprs: Column*): DataFrame

agg(exprs: Map[String, String]): DataFrame

An untyped transformation

alias



alias(alias: String): Dataset[T]
alias(alias: Symbol): Dataset[T]

alias(alias: String): Dataset[T]

alias(alias: Symbol): Dataset[T]

A typed transformation that is a mere synonym of as.

apply



apply(colName: String): Column

apply(colName: String): Column

An untyped transformation to select a column based on the column name (i.e. maps a Dataset onto a Column)



as(alias: String): Dataset[T]
as(alias: Symbol): Dataset[T]

as(alias: String): Dataset[T]

as(alias: Symbol): Dataset[T]

A typed transformation



as[U : Encoder]: Dataset[U]

as[U : Encoder]: Dataset[U]

A typed transformation to enforce a type, i.e. marking the records in the Dataset as of a given data type (data type conversion). as simply changes the view of the data that is passed into typed operations (e.g. map) and does not eagerly project away any columns that are not present in the specified class.

cache



cache(): this.type

cache(): this.type

A basic action that is a mere synonym of persist.

checkpoint



checkpoint(): Dataset[T]
checkpoint(eager: Boolean): Dataset[T]

checkpoint(): Dataset[T]

checkpoint(eager: Boolean): Dataset[T]

A basic action to checkpoint the Dataset in a reliable way (using a reliable HDFS-compliant file system, e.g. Hadoop HDFS or Amazon S3)

coalesce



coalesce(numPartitions: Int): Dataset[T]

coalesce(numPartitions: Int): Dataset[T]

A typed transformation to repartition a Dataset

col



col(colName: String): Column

col(colName: String): Column

An untyped transformation to create a column (reference) based on the column name

collect



collect(): Array[T]

collect(): Array[T]

An action

colRegex



colRegex(colName: String): Column

colRegex(colName: String): Column

An untyped transformation to create a column (reference) based on the column name specified as a regex

columns



columns: Array[String]

columns: Array[String]

A basic action

count



count(): Long

count(): Long

An action to count the number of rows

createGlobalTempView



createGlobalTempView(viewName: String): Unit

createGlobalTempView(viewName: String): Unit

A basic action

createOrReplaceGlobalTempView



createOrReplaceGlobalTempView(viewName: String): Unit

createOrReplaceGlobalTempView(viewName: String): Unit

A basic action

createOrReplaceTempView



createOrReplaceTempView(viewName: String): Unit

createOrReplaceTempView(viewName: String): Unit

A basic action

createTempView



createTempView(viewName: String): Unit

createTempView(viewName: String): Unit

A basic action

crossJoin



crossJoin(right: Dataset[_]): DataFrame

crossJoin(right: Dataset[_]): DataFrame

An untyped transformation

cube



cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset

cube(cols: Column*): RelationalGroupedDataset

cube(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

describe



describe(cols: String*): DataFrame

describe(cols: String*): DataFrame

An action

distinct



distinct(): Dataset[T]

distinct(): Dataset[T]

A typed transformation that is a mere synonym of dropDuplicates (with all the columns of the Dataset)

drop



drop(colName: String): DataFrame
drop(colNames: String*): DataFrame
drop(col: Column): DataFrame

drop(colName: String): DataFrame

drop(colNames: String*): DataFrame

drop(col: Column): DataFrame

An untyped transformation

dropDuplicates



dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Array[String]): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

dropDuplicates(): Dataset[T]

dropDuplicates(colNames: Array[String]): Dataset[T]

dropDuplicates(colNames: Seq[String]): Dataset[T]

dropDuplicates(col1: String, cols: String*): Dataset[T]

A typed transformation

dtypes



dtypes: Array[(String, String)]

dtypes: Array[(String, String)]

A basic action

except



except(other: Dataset[T]): Dataset[T]

except(other: Dataset[T]): Dataset[T]

A typed transformation

exceptAll



exceptAll(other: Dataset[T]): Dataset[T]

exceptAll(other: Dataset[T]): Dataset[T]

(New in 2.4.0) A typed transformation

explain



explain(): Unit
explain(extended: Boolean): Unit

explain(): Unit

explain(extended: Boolean): Unit

A basic action to display the logical and physical plans of the Dataset, i.e. displays the logical and physical plans (with optional cost and codegen summaries) to the standard output

filter



filter(condition: Column): Dataset[T]
filter(conditionExpr: String): Dataset[T]
filter(func: T => Boolean): Dataset[T]

filter(condition: Column): Dataset[T]

filter(conditionExpr: String): Dataset[T]

filter(func: T => Boolean): Dataset[T]

A typed transformation

first



first(): T

first(): T

An action that is a mere synonym of head

flatMap



flatMap[U : Encoder](func: T => TraversableOnce[U]): Dataset[U]

flatMap[U : Encoder](func: T => TraversableOnce[U]): Dataset[U]

A typed transformation

foreach



foreach(f: T => Unit): Unit

foreach(f: T => Unit): Unit

An action

foreachPartition



foreachPartition(f: Iterator[T] => Unit): Unit

foreachPartition(f: Iterator[T] => Unit): Unit

An action

groupBy



groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy(cols: Column*): RelationalGroupedDataset

groupBy(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

groupByKey



groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

A typed transformation

head



head(): T (1)
head(n: Int): Array[T]

head(): T (1)

head(n: Int): Array[T]

Uses 1 for n

An action

hint



hint(name: String, parameters: Any*): Dataset[T]

hint(name: String, parameters: Any*): Dataset[T]

A basic action to specify a hint (and optional parameters)

inputFiles



inputFiles: Array[String]

inputFiles: Array[String]

A basic action

intersect



intersect(other: Dataset[T]): Dataset[T]

intersect(other: Dataset[T]): Dataset[T]

A typed transformation

intersectAll



intersectAll(other: Dataset[T]): Dataset[T]

intersectAll(other: Dataset[T]): Dataset[T]

(New in 2.4.0) A typed transformation

isEmpty



isEmpty: Boolean

isEmpty: Boolean

(New in 2.4.0) A basic action

isLocal



isLocal: Boolean

isLocal: Boolean

A basic action

isStreaming



isStreaming: Boolean

isStreaming: Boolean

join



join(right: Dataset[_]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

join(right: Dataset[_]): DataFrame

join(right: Dataset[_], usingColumn: String): DataFrame

join(right: Dataset[_], usingColumns: Seq[String]): DataFrame

join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

join(right: Dataset[_], joinExprs: Column): DataFrame

join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

An untyped transformation

joinWith



joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]
joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

A typed transformation

limit



limit(n: Int): Dataset[T]

limit(n: Int): Dataset[T]

A typed transformation

localCheckpoint



localCheckpoint(): Dataset[T]
localCheckpoint(eager: Boolean): Dataset[T]

localCheckpoint(): Dataset[T]

localCheckpoint(eager: Boolean): Dataset[T]

A basic action to checkpoint the Dataset locally on executors (and therefore unreliably)

map



map[U: Encoder](func: T => U): Dataset[U]

map[U: Encoder](func: T => U): Dataset[U]

A typed transformation

mapPartitions



mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

A typed transformation



na: DataFrameNaFunctions

na: DataFrameNaFunctions

An untyped transformation

orderBy



orderBy(sortExprs: Column*): Dataset[T]
orderBy(sortCol: String, sortCols: String*): Dataset[T]

orderBy(sortExprs: Column*): Dataset[T]

orderBy(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation

persist



persist(): this.type
persist(newLevel: StorageLevel): this.type

persist(): this.type

persist(newLevel: StorageLevel): this.type

A basic action to persist the Dataset

printSchema



printSchema(): Unit

printSchema(): Unit

A basic action

randomSplit



randomSplit(weights: Array[Double]): Array[Dataset[T]]
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

randomSplit(weights: Array[Double]): Array[Dataset[T]]

randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

A typed transformation to split a Dataset randomly into two Datasets

rdd



rdd: RDD[T]

rdd: RDD[T]

A basic action

reduce



reduce(func: (T, T) => T): T

reduce(func: (T, T) => T): T

An action to reduce the records of the Dataset using the specified binary function.

repartition



repartition(partitionExprs: Column*): Dataset[T]
repartition(numPartitions: Int): Dataset[T]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartition(partitionExprs: Column*): Dataset[T]

repartition(numPartitions: Int): Dataset[T]

repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

A typed transformation to repartition a Dataset

repartitionByRange



repartitionByRange(partitionExprs: Column*): Dataset[T]
repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartitionByRange(partitionExprs: Column*): Dataset[T]

repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

A typed transformation

rollup



rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset

rollup(cols: Column*): RelationalGroupedDataset

rollup(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

sample



sample(withReplacement: Boolean, fraction: Double): Dataset[T]
sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]
sample(fraction: Double): Dataset[T]
sample(fraction: Double, seed: Long): Dataset[T]

sample(withReplacement: Boolean, fraction: Double): Dataset[T]

sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

sample(fraction: Double): Dataset[T]

sample(fraction: Double, seed: Long): Dataset[T]

A typed transformation

schema



schema: StructType

schema: StructType

A basic action

select



select(cols: Column*): DataFrame
select(col: String, cols: String*): DataFrame

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]
select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
select[U1, U2, U3](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
select[U1, U2, U3, U4](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3],
  c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]
select[U1, U2, U3, U4, U5](
  c1: TypedColumn[T, U1],
  c2: TypedColumn[T, U2],
  c3: TypedColumn[T, U3],
  c4: TypedColumn[T, U4],
  c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

select(cols: Column*): DataFrame

select(col: String, cols: String*): DataFrame

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]

select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

select[U1, U2, U3](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

select[U1, U2, U3, U4](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3],

c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

select[U1, U2, U3, U4, U5](

c1: TypedColumn[T, U1],

c2: TypedColumn[T, U2],

c3: TypedColumn[T, U3],

c4: TypedColumn[T, U4],

c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

An (untyped and typed) transformation

selectExpr



selectExpr(exprs: String*): DataFrame

selectExpr(exprs: String*): DataFrame

An untyped transformation

show



show(): Unit
show(truncate: Boolean): Unit
show(numRows: Int): Unit
show(numRows: Int, truncate: Boolean): Unit
show(numRows: Int, truncate: Int): Unit
show(numRows: Int, truncate: Int, vertical: Boolean): Unit

show(): Unit

show(truncate: Boolean): Unit

show(numRows: Int): Unit

show(numRows: Int, truncate: Boolean): Unit

show(numRows: Int, truncate: Int): Unit

show(numRows: Int, truncate: Int, vertical: Boolean): Unit

An action

sort



sort(sortExprs: Column*): Dataset[T]
sort(sortCol: String, sortCols: String*): Dataset[T]

sort(sortExprs: Column*): Dataset[T]

sort(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation to sort elements globally (across partitions). Use sortWithinPartitions transformation for partition-local sort

sortWithinPartitions



sortWithinPartitions(sortExprs: Column*): Dataset[T]
sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

sortWithinPartitions(sortExprs: Column*): Dataset[T]

sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

A typed transformation to sort elements within partitions (aka local sort). Use sort transformation for global sort (across partitions)

stat



stat: DataFrameStatFunctions

stat: DataFrameStatFunctions

An untyped transformation

storageLevel



storageLevel: StorageLevel

storageLevel: StorageLevel

A basic action

summary



summary(statistics: String*): DataFrame

summary(statistics: String*): DataFrame

An action to calculate statistics (e.g. count, mean, stddev, min, max and 25%, 50%, 75% percentiles)

take



take(n: Int): Array[T]

take(n: Int): Array[T]

An action to take the first records of a Dataset

toDF



toDF(): DataFrame
toDF(colNames: String*): DataFrame

toDF(): DataFrame

toDF(colNames: String*): DataFrame

A basic action to convert a Dataset to a DataFrame

toJSON



toJSON: Dataset[String]

toJSON: Dataset[String]

A typed transformation

toLocalIterator



toLocalIterator(): java.util.Iterator[T]

toLocalIterator(): java.util.Iterator[T]

An action that returns an iterator with all rows in the Dataset. The iterator will consume as much memory as the largest partition in the Dataset.

transform



transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

A typed transformation for chaining custom transformations

union



union(other: Dataset[T]): Dataset[T]

union(other: Dataset[T]): Dataset[T]

A typed transformation

unionByName



unionByName(other: Dataset[T]): Dataset[T]

unionByName(other: Dataset[T]): Dataset[T]

A typed transformation

unpersist



unpersist(): this.type (1)
unpersist(blocking: Boolean): this.type

unpersist(): this.type (1)

unpersist(blocking: Boolean): this.type

Uses unpersist with blocking disabled (false)

A basic action to unpersist the Dataset

where



where(condition: Column): Dataset[T]
where(conditionExpr: String): Dataset[T]

where(condition: Column): Dataset[T]

where(conditionExpr: String): Dataset[T]

A typed transformation

withColumn



withColumn(colName: String, col: Column): DataFrame

withColumn(colName: String, col: Column): DataFrame

An untyped transformation

withColumnRenamed



withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed(existingName: String, newName: String): DataFrame

An untyped transformation

write



write: DataFrameWriter[T]

write: DataFrameWriter[T]

A basic action that returns a DataFrameWriter for saving the content of the (non-streaming) Dataset out to an external storage

Dataset API — Dataset Operators

Dataset API — Dataset Operators

相关推荐

欢迎关注：spark技术分享

热门标签

近期文章

分类目录

关注公众号：spark技术分享

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部