关注 spark技术分享,
撸spark源码 玩spark最佳实践

Dataset API — Dataset Operators

Dataset API — Dataset Operators

Dataset API is a set of operators with typed and untyped transformations, and actions to work with a structured query (as a Dataset) as a whole.

Table 1. Dataset Operators (Transformations and Actions)
Operator Description

agg

An untyped transformation

alias

A typed transformation that is a mere synonym of as.

apply

An untyped transformation to select a column based on the column name (i.e. maps a Dataset onto a Column)

as

A typed transformation

as

A typed transformation to enforce a type, i.e. marking the records in the Dataset as of a given data type (data type conversion). as simply changes the view of the data that is passed into typed operations (e.g. map) and does not eagerly project away any columns that are not present in the specified class.

cache

A basic action that is a mere synonym of persist.

checkpoint

A basic action to checkpoint the Dataset in a reliable way (using a reliable HDFS-compliant file system, e.g. Hadoop HDFS or Amazon S3)

coalesce

A typed transformation to repartition a Dataset

col

An untyped transformation to create a column (reference) based on the column name

collect

An action

colRegex

An untyped transformation to create a column (reference) based on the column name specified as a regex

columns

A basic action

count

An action to count the number of rows

createGlobalTempView

A basic action

createOrReplaceGlobalTempView

A basic action

createOrReplaceTempView

A basic action

createTempView

A basic action

crossJoin

An untyped transformation

cube

An untyped transformation

describe

An action

distinct

A typed transformation that is a mere synonym of dropDuplicates (with all the columns of the Dataset)

drop

An untyped transformation

dropDuplicates

A typed transformation

dtypes

A basic action

except

A typed transformation

exceptAll

(New in 2.4.0) A typed transformation

explain

A basic action to display the logical and physical plans of the Dataset, i.e. displays the logical and physical plans (with optional cost and codegen summaries) to the standard output

filter

A typed transformation

first

An action that is a mere synonym of head

flatMap

A typed transformation

foreach

An action

foreachPartition

An action

groupBy

An untyped transformation

groupByKey

A typed transformation

head

  1. Uses 1 for n

An action

hint

A basic action to specify a hint (and optional parameters)

inputFiles

A basic action

intersect

A typed transformation

intersectAll

(New in 2.4.0) A typed transformation

isEmpty

(New in 2.4.0) A basic action

isLocal

A basic action

isStreaming

join

An untyped transformation

joinWith

A typed transformation

limit

A typed transformation

localCheckpoint

A basic action to checkpoint the Dataset locally on executors (and therefore unreliably)

map

A typed transformation

mapPartitions

A typed transformation

na

An untyped transformation

orderBy

A typed transformation

persist

A basic action to persist the Dataset

printSchema

A basic action

randomSplit

A typed transformation to split a Dataset randomly into two Datasets

rdd

A basic action

reduce

An action to reduce the records of the Dataset using the specified binary function.

repartition

A typed transformation to repartition a Dataset

repartitionByRange

A typed transformation

rollup

An untyped transformation

sample

A typed transformation

schema

A basic action

select

An (untyped and typed) transformation

selectExpr

An untyped transformation

show

An action

sort

A typed transformation to sort elements globally (across partitions). Use sortWithinPartitions transformation for partition-local sort

sortWithinPartitions

A typed transformation to sort elements within partitions (aka local sort). Use sort transformation for global sort (across partitions)

stat

An untyped transformation

storageLevel

A basic action

summary

An action to calculate statistics (e.g. count, mean, stddev, min, max and 25%, 50%, 75% percentiles)

take

An action to take the first records of a Dataset

toDF

A basic action to convert a Dataset to a DataFrame

toJSON

A typed transformation

toLocalIterator

An action that returns an iterator with all rows in the Dataset. The iterator will consume as much memory as the largest partition in the Dataset.

transform

A typed transformation for chaining custom transformations

union

A typed transformation

unionByName

A typed transformation

unpersist

  1. Uses unpersist with blocking disabled (false)

A basic action to unpersist the Dataset

where

A typed transformation

withColumn

An untyped transformation

withColumnRenamed

An untyped transformation

write

A basic action that returns a DataFrameWriter for saving the content of the (non-streaming) Dataset out to an external storage

赞(0) 打赏
未经允许不得转载:spark技术分享 » Dataset API — Dataset Operators
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏