关注 spark技术分享,
撸spark源码 玩spark最佳实践

Dataset Join Operators

Dataset Join Operators

From PostgreSQL’s 2.6. Joins Between Tables:

Queries can access multiple tables at once, or access the same table in such a way that multiple rows of the table are being processed at the same time. A query that accesses multiple rows of the same or different tables at one time is called a join query.

You can join two datasets using the join operators with an optional join condition.

Table 1. Join Operators
Operator Return Type Description

crossJoin

DataFrame

Untyped Row-based cross join

join

DataFrame

Untyped Row-based join

joinWith

Dataset

Used for a type-preserving join with two output columns for records for which a join condition holds

You can also use SQL mode to join datasets using good ol’ SQL.

You can specify a join condition (aka join expression) as part of join operators or using where or filter operators.

You can specify the join type as part of join operators (using joinType optional parameter).

Table 2. Join Types
SQL Name (joinType) JoinType

CROSS

cross

Cross

INNER

inner

Inner

FULL OUTER

outer, full, fullouter

FullOuter

LEFT ANTI

leftanti

LeftAnti

LEFT OUTER

leftouter, left

LeftOuter

LEFT SEMI

leftsemi

LeftSemi

RIGHT OUTER

rightouter, right

RightOuter

NATURAL

Special case for Inner, LeftOuter, RightOuter, FullOuter

NaturalJoin

USING

Special case for Inner, LeftOuter, LeftSemi, RightOuter, FullOuter, LeftAnti

UsingJoin

ExistenceJoin is an artifical join type used to express an existential sub-query, that is often referred to as existential join.

Note
LeftAnti and ExistenceJoin are special cases of LeftOuter.

You can also find that Spark SQL uses the following two families of joins:

Tip
Name are case-insensitive and can use the underscore (_) at any position, i.e. left_anti and LEFT_ANTI are equivalent.
Note
Spark SQL offers different join strategies with Broadcast Joins (aka Map-Side Joins) among them that are supposed to optimize your join queries over large distributed datasets.

join Operators

  1. Condition-less inner join

  2. Inner join with a single column that exists on both sides

  3. Inner join with columns that exist on both sides

  4. Equi-join with explicit join type

  5. Inner join

  6. Join with explicit join type. Self-joins are acceptable.

join joins two Datasets.

Internally, join(right: Dataset[_]) creates a DataFrame with a condition-less Join logical operator (in the current SparkSession).

Note
join(right: Dataset[_]) creates a logical plan with a condition-less Join operator with two child logical plans of the both sides of the join.
Note
join(right: Dataset[_], usingColumns: Seq[String], joinType: String) creates a logical plan with a condition-less Join operator with UsingJoin join type.
Note

join(right: Dataset[_], joinExprs: Column, joinType: String) accepts self-joins where joinExprs is of the form:

That is usually considered a trivially true condition and refused as acceptable.

With spark.sql.selfJoinAutoResolveAmbiguity option enabled (which it is by default), join will automatically resolve ambiguous join conditions into ones that might make sense.

crossJoin Method

crossJoin joins two Datasets using Cross join type with no condition.

Note
crossJoin creates an explicit cartesian join that can be very expensive without an extra filter (that can be pushed down).

Type-Preserving Joins — joinWith Operators

  1. inner equi-join

joinWith creates a Dataset with two columns _1 and _2 that each contain records for which condition holds.

Note
joinWith preserves type-safety with the original object types.
Note
joinWith creates a Dataset with Join logical plan.
赞(0) 打赏
未经允许不得转载:spark技术分享 » Dataset Join Operators
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏