Dataset API vs SQL
Spark SQL supports two “modes” to write structured queries: Dataset API and SQL.
It turns out that some structured queries can be expressed easier using Dataset API, but there are some that are only possible in SQL. In other words, you may find mixing Dataset API and SQL modes challenging yet rewarding.
You could at some point consider writing structured queries using Catalyst data structures directly hoping to avoid the differences and focus on what is supported in Spark SQL, but that could quickly become unwieldy for maintenance (i.e. finding Spark SQL developers who could be comfortable with it as well as being fairly low-level and therefore possibly too dependent on a specific Spark SQL version).
This section describes the differences between Spark SQL features to develop Spark applications using Dataset API and SQL mode.
-
RuntimeReplaceable Expressions are only available using SQL mode by means of SQL functions like
nvl
,nvl2
,ifnull
,nullif
, etc. -
Column.isin and SQL IN predicate with a subquery (and In Predicate Expression)