Subqueries (Subquery Expressions)
As of Spark 2.0, Spark SQL supports subqueries.
A subquery (aka subquery expression) is a query that is nested inside of another query.
There are the following kinds of subqueries:
-
A subquery as a source (inside a SQL
FROM
clause) -
A scalar subquery or a predicate subquery (as a column)
Every subquery can also be correlated or uncorrelated.
A scalar subquery is a structured query that returns a single row and a single column only. Spark SQL uses ScalarSubquery (SubqueryExpression) expression to represent scalar subqueries (while parsing a SQL statement).
1 2 3 4 5 |
// FIXME: ScalarSubquery in a logical plan |
A ScalarSubquery
expression appears as scalar-subquery#[exprId] [conditionString] in a logical plan.
1 2 3 4 5 |
// FIXME: Name of a ScalarSubquery in a logical plan |
It is said that scalar subqueries should be used very rarely if at all and you should join instead.
Spark Analyzer uses ResolveSubquery resolution rule to resolve subqueries and at the end makes sure that they are valid.
Catalyst Optimizer uses the following optimizations for subqueries:
-
PullupCorrelatedPredicates optimization to rewrite subqueries and pull up correlated predicates
-
RewriteCorrelatedScalarSubquery optimization to constructLeftJoins
Spark Physical Optimizer uses PlanSubqueries physical optimization to plan queries with scalar subqueries.
Caution
|
FIXME Describe how a physical ScalarSubquery is executed (cf. updateResult , eval and doGenCode ).
|