关注 spark技术分享,
撸spark源码 玩spark最佳实践

Typed Transformations

Dataset API — Typed Transformations

Typed transformations are part of the Dataset API for transforming a Dataset with an Encoder (except the RowEncoder).

Note
Typed transformations are the methods in the Dataset Scala class that are grouped in typedrel group name, i.e. @group typedrel.
Table 1. Dataset API’s Typed Transformations
Transformation Description

alias

as

as

coalesce

Repartitions a Dataset

distinct

dropDuplicates

except

filter

flatMap

groupByKey

intersect

joinWith

limit

map

mapPartitions

orderBy

randomSplit

repartition

repartitionByRange

sample

select

sort

sortWithinPartitions

toJSON

transform

union

unionByName

where

as Typed Transformation

as…​FIXME

Enforcing Type — as Typed Transformation

as[T] allows for converting from a weakly-typed Dataset of Rows to Dataset[T] with T being a domain class (that can enforce a stronger schema).

Repartitioning Dataset with Shuffle Disabled — coalesce Typed Transformation

coalesce operator repartitions the Dataset to exactly numPartitions partitions.

Internally, coalesce creates a Repartition logical operator with shuffle disabled (which is marked as false in the below explain‘s output).

dropDuplicates Typed Transformation

dropDuplicates…​FIXME

except Typed Transformation

except…​FIXME

exceptAll Typed Transformation

exceptAll…​FIXME

filter Typed Transformation

filter…​FIXME

Creating Zero or More Records — flatMap Typed Transformation

flatMap returns a new Dataset (of type U) with all records (of type T) mapped over using the function func and then flattening the results.

Note
flatMap can create new records. It deprecated explode.

Internally, flatMap calls mapPartitions with the partitions flatMap(ped).

intersect Typed Transformation

intersect…​FIXME

intersectAll Typed Transformation

intersectAll…​FIXME

joinWith Typed Transformation

joinWith…​FIXME

limit Typed Transformation

limit…​FIXME

map Typed Transformation

map…​FIXME

mapPartitions Typed Transformation

mapPartitions…​FIXME

Randomly Split Dataset Into Two or More Datasets Per Weight — randomSplit Typed Transformation

randomSplit randomly splits the Dataset per weights.

weights doubles should sum up to 1 and will be normalized if they do not.

You can define seed and if you don’t, a random seed will be used.

Note
randomSplit is commonly used in Spark MLlib to split an input Dataset into two datasets for training and validation.

Repartitioning Dataset (Shuffle Enabled) — repartition Typed Transformation

repartition operators repartition the Dataset to exactly numPartitions partitions or using partitionExprs expressions.

Internally, repartition creates a Repartition or RepartitionByExpression logical operators with shuffle enabled (which is true in the below explain‘s output beside Repartition).

Note
repartition methods correspond to SQL’s DISTRIBUTE BY or CLUSTER BY clauses.

repartitionByRange Typed Transformation

  1. Uses spark.sql.shuffle.partitions configuration property for the number of partitions to use

repartitionByRange simply creates a Dataset with a RepartitionByExpression logical operator.

repartitionByRange uses a SortOrder with the Ascending sort order, i.e. ascending nulls first, when no explicit sort order is specified.

repartitionByRange throws a IllegalArgumentException when no partitionExprs partition-by expression is specified.

sample Typed Transformation

sample…​FIXME

select Typed Transformation

select…​FIXME

sort Typed Transformation

sort…​FIXME

sortWithinPartitions Typed Transformation

sortWithinPartitions simply calls the internal sortInternal method with the global flag disabled (false).

toJSON Typed Transformation

toJSON maps the content of Dataset to a Dataset of strings in JSON format.

Internally, toJSON grabs the RDD[InternalRow] (of the QueryExecution of the Dataset) and maps the records (per RDD partition) into JSON.

Note
toJSON uses Jackson’s JSON parser — jackson-module-scala.

Transforming Datasets — transform Typed Transformation

transform applies t function to the source Dataset[T] to produce a result Dataset[U]. It is for chaining custom transformations.

Internally, transform executes t function on the current Dataset[T].

union Typed Transformation

union…​FIXME

unionByName Typed Transformation

unionByName creates a new Dataset that is an union of the rows in this and the other Datasets column-wise, i.e. the order of columns in Datasets does not matter as long as their names and number match.

Internally, unionByName creates a Union logical operator for this Dataset and Project logical operator with the other Dataset.

In the end, unionByName applies the CombineUnions logical optimization to the Union logical operator and requests the result LogicalPlan to wrap the child operators with AnalysisBarriers.

unionByName throws an AnalysisException if there are duplicate columns in either Dataset.

unionByName throws an AnalysisException if there are columns in this Dataset has a column that is not available in the other Dataset.

where Typed Transformation

where is simply a synonym of the filter operator, i.e. passes the input parameters along to filter.

Creating Streaming Dataset with EventTimeWatermark Logical Operator — withWatermark Streaming Typed Transformation

Internally, withWatermark creates a Dataset with EventTimeWatermark logical plan for streaming Datasets.

Note
withWatermark uses EliminateEventTimeWatermark logical rule to eliminate EventTimeWatermark logical plan for non-streaming batch Datasets.

Note

delayThreshold is parsed using CalendarInterval.fromString with interval formatted as described in TimeWindow unary expression.

Note
delayThreshold must not be negative (and milliseconds and months should both be equal or greater than 0).
Note
withWatermark is used when…​FIXME
赞(0) 打赏
未经允许不得转载:spark技术分享 » Typed Transformations
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏