spark-sql-spark技术分享-第40页

SparkOptimizer — Logical Query Plan Optimizer

2012-07-07admin阅读(1881)

SparkOptimizer — Logical Query Plan Optimizer

SparkOptimizer is a concrete logical query plan optimizer with additional optimization rules (that extend the base logical optimization rules).

SparkOptimizer gives three extension points for additional optimization rules:

Pre-Optimization Batches
Post-Hoc Optimization Batches
User Provided Optimizers (as extraOptimizations of the ExperimentalMethods)

SparkOptimizer is created when SessionState is requested for the Logical Optimizer the first time (through BaseSessionStateBuilder).

Figure 1. Creating SparkOptimizer

SparkOptimizer is available as the optimizer property of a session-specific SessionState.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer

// It is a SparkOptimizer really.
// Let's check that out with a type cast

import org.apache.spark.sql.execution.SparkOptimizer
scala> spark.sessionState.optimizer.isInstanceOf[SparkOptimizer]
res1: Boolean = true

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.optimizer

org.apache.spark.sql.catalyst.optimizer.Optimizer

// It is a SparkOptimizer really.

// Let's check that out with a type cast

import org.apache.spark.sql.execution.SparkOptimizer

scala> spark.sessionState.optimizer.isInstanceOf[SparkOptimizer]

res1: Boolean = true

You can access the optimization logical plan of a structured query through the QueryExecution as optimizedPlan.



// Applying two filter in sequence on purpose
// We want to kick CombineTypedFilters optimizer in
val dataset = spark.range(10).filter(_ % 2 == 0).filter(_ == 0)

// optimizedPlan is a lazy value
// Only at the first time you call it you will trigger optimizations
// Next calls end up with the cached already-optimized result
// Use explain to trigger optimizations again
scala> dataset.queryExecution.optimizedPlan
res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long)
+- Range (0, 10, step=1, splits=Some(8))

// Applying two filter in sequence on purpose

// We want to kick CombineTypedFilters optimizer in

val dataset = spark.range(10).filter(_ % 2 == 0).filter(_ == 0)

// optimizedPlan is a lazy value

// Only at the first time you call it you will trigger optimizations

// Next calls end up with the cached already-optimized result

// Use explain to trigger optimizations again

scala> dataset.queryExecution.optimizedPlan

res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =

TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long)

+- Range (0, 10, step=1, splits=Some(8))

SparkOptimizer defines the custom default rule batches.

Table 1. SparkOptimizer’s Default Optimization Batch Rules (in the order of execution)
Batch Name	Strategy	Rules	Description
		preOptimizationBatches
		Base Logical Optimization Batches
Optimize Metadata Only Query	`Once`	OptimizeMetadataOnlyQuery
Extract Python UDF from Aggregate	`Once`	ExtractPythonUDFFromAggregate
Prune File Source Table Partitions	`Once`	PruneFileSourcePartitions
Push down operators to data source scan	`Once`	PushDownOperatorsToDataSource	Pushes down operators to underlying data sources (i.e. DataSourceV2Relations)
		postHocOptimizationBatches
User Provided Optimizers	FixedPoint	extraOptimizations of the ExperimentalMethods

SparkOptimizer considers ExtractPythonUDFFromAggregate optimization rule as non-excludable.

Tip

Enable DEBUG or TRACE logging levels for org.apache.spark.sql.execution.SparkOptimizer logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.execution.SparkOptimizer=TRACE

log4j.logger.org.apache.spark.sql.execution.SparkOptimizer=TRACE

Refer to Logging.

Creating SparkOptimizer Instance

SparkOptimizer takes the following when created:

Extension Point for Additional Pre-Optimization Batches — `preOptimizationBatches` Method



preOptimizationBatches: Seq[Batch]

preOptimizationBatches: Seq[Batch]

preOptimizationBatches are the additional pre-optimization batches that are executed right before the regular optimization batches.

Extension Point for Additional Post-Hoc Optimization Batches — `postHocOptimizationBatches` Method



postHocOptimizationBatches: Seq[Batch] = Nil

postHocOptimizationBatches: Seq[Batch] = Nil

postHocOptimizationBatches are the additional post-optimization batches that are executed right after the regular optimization batches (before User Provided Optimizers).

CheckAnalysis — Analysis Validation

2012-07-06admin阅读(2549)

CheckAnalysis — Analysis Validation

CheckAnalysis defines checkAnalysis method that Analyzer uses to check if a logical plan is correct (after all the transformations) by applying validation rules and in the end marking it as analyzed.

Note	An analyzed logical plan is correct and ready for execution.

CheckAnalysis defines extendedCheckRules extension point that allows for extra analysis check rules.

Validating Analysis of Logical Plan (and Marking Plan As Analyzed) — `checkAnalysis` Method



checkAnalysis(plan: LogicalPlan): Unit

checkAnalysis(plan: LogicalPlan): Unit

checkAnalysis recursively checks the correctness of the analysis of the input logical plan and marks it as analyzed.

Note	`checkAnalysis` fails analysis when finds UnresolvedRelation in the input `LogicalPlan`…FIXME What else?

Internally, checkAnalysis processes nodes in the input plan (starting from the leafs, i.e. nodes down the operator tree).

checkAnalysis skips logical plans that have already undergo analysis.

LogicalPlan/Operator Behaviour

UnresolvedRelation

Fails analysis with the error message:



Table or view not found: [tableIdentifier]

Table or view not found: [tableIdentifier]

Unresolved Attribute

Fails analysis with the error message:



cannot resolve '[expr]' given input columns: [from]

cannot resolve '[expr]' given input columns: [from]

Expression with incorrect input data types

Fails analysis with the error message:



cannot resolve '[expr]' due to data type mismatch: [message]

cannot resolve '[expr]' due to data type mismatch: [message]

Unresolved Cast

Fails analysis with the error message:



invalid cast from [dataType] to [dataType]

invalid cast from [dataType] to [dataType]

Grouping

Fails analysis with the error message:



grouping() can only be used with GroupingSets/Cube/Rollup

grouping() can only be used with GroupingSets/Cube/Rollup

GroupingID

Fails analysis with the error message:



grouping_id() can only be used with GroupingSets/Cube/Rollup

grouping_id() can only be used with GroupingSets/Cube/Rollup

WindowExpressions with a AggregateExpression window function with isDistinct flag on

Fails analysis with the error message:



Distinct window functions are not supported: [w]

Distinct window functions are not supported: [w]

Example:



val windowedDistinctCountExpr = "COUNT(DISTINCT 1) OVER (PARTITION BY value)"
scala> spark.emptyDataset[Int].selectExpr(windowedDistinctCountExpr)
org.apache.spark.sql.AnalysisException: Distinct window functions are not supported: count(distinct 1) windowspecdefinition(value#95, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);;
Project [COUNT(1) OVER (PARTITION BY value UnspecifiedFrame)#97L]
+- Project [value#95, COUNT(1) OVER (PARTITION BY value UnspecifiedFrame)#97L, COUNT(1) OVER (PARTITION BY value UnspecifiedFrame)#97L]
   +- Window [count(distinct 1) windowspecdefinition(value#95, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS COUNT(1) OVER (PARTITION BY value UnspecifiedFrame)#97L], [value#95]
      +- Project [value#95]
         +- LocalRelation <empty>, [value#95]

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:108)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)

val windowedDistinctCountExpr = "COUNT(DISTINCT 1) OVER (PARTITION BY value)"

scala> spark.emptyDataset[Int].selectExpr(windowedDistinctCountExpr)

org.apache.spark.sql.AnalysisException: Distinct window functions are not supported: count(distinct 1) windowspecdefinition(value#95, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);;

Project [COUNT(1) OVER (PARTITION BY value UnspecifiedFrame)#97L]

+- Project [value#95, COUNT(1) OVER (PARTITION BY value UnspecifiedFrame)#97L, COUNT(1) OVER (PARTITION BY value UnspecifiedFrame)#97L]

+- Window [count(distinct 1) windowspecdefinition(value#95, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS COUNT(1) OVER (PARTITION BY value UnspecifiedFrame)#97L], [value#95]

+- Project [value#95]

+- LocalRelation <empty>, [value#95]

at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)

at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90)

at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:108)

at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)

WindowExpressions with a OffsetWindowFunction window function with an empty order specification or a non-offset window frame specification

Fails analysis with the error message:



An offset window function can only be evaluated in an ordered row-based window frame with a single offset: [windowExpr]

An offset window function can only be evaluated in an ordered row-based window frame with a single offset: [windowExpr]

WindowExpressions with a window function that is not one of the following expressions: AggregateExpression, AggregateWindowFunction or OffsetWindowFunction

Fails analysis with the error message:



Expression '[e]' not supported within a window function.

Expression '[e]' not supported within a window function.

Nondeterministic expressions

FIXME

UnresolvedHint

FIXME

After the validations, checkAnalysis executes additional check rules for correct analysis.

checkAnalysis then checks if plan is analyzed correctly (i.e. no logical plans are left unresolved). If there is one, checkAnalysis fails the analysis with AnalysisException and the following error message:



unresolved operator [o.simpleString]

unresolved operator [o.simpleString]

In the end, checkAnalysis marks the entire logical plan as analyzed.

Note	`checkAnalysis` is used when: `QueryExecution` creates analyzed logical plan and checks its correctness (which happens mostly when a `Dataset` is created) `ExpressionEncoder` does resolveAndBind `ResolveAggregateFunctions` is executed (for `Sort` logical plan)

Extended Analysis Check Rules — `extendedCheckRules` Extension Point



extendedCheckRules: Seq[LogicalPlan => Unit]

extendedCheckRules: Seq[LogicalPlan => Unit]

extendedCheckRules is a collection of rules (functions) that checkAnalysis uses for custom analysis checks (after the main validations have been executed).

Note	When a condition of a rule does not hold the function throws an `AnalysisException` directly or using `failAnalysis` method.

`checkSubqueryExpression` Internal Method



checkSubqueryExpression(plan: LogicalPlan, expr: SubqueryExpression): Unit

checkSubqueryExpression(plan: LogicalPlan, expr: SubqueryExpression): Unit

checkSubqueryExpression…FIXME

Note	`checkSubqueryExpression` is used exclusively when `CheckAnalysis` is requested to validate analysis of a logical plan (for `SubqueryExpression` expressions).

Analyzer — Logical Query Plan Analyzer

2012-07-05admin阅读(1826)

Analyzer — Logical Query Plan Analyzer

Analyzer (aka Spark Analyzer or Query Analyzer) is the logical query plan analyzer that semantically validates and transforms an unresolved logical plan to an analyzed logical plan.

Analyzer is a concrete RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]) with the logical evaluation rules.



Analyzer: Unresolved Logical Plan ==> Analyzed Logical Plan

Analyzer: Unresolved Logical Plan ==> Analyzed Logical Plan

Analyzer uses SessionCatalog while resolving relational entities, e.g. databases, tables, columns.

Analyzer is created when SessionState is requested for the analyzer.

Figure 1. Creating Analyzer

Analyzer is available as the analyzer property of a session-specific SessionState.



scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.analyzer
org.apache.spark.sql.catalyst.analysis.Analyzer

scala> :type spark

org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.analyzer

org.apache.spark.sql.catalyst.analysis.Analyzer

You can access the analyzed logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL’s EXPLAIN EXTENDED SQL command.



// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

// sample structured query

val inventory = spark

.range(5)

.withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)

scala> inventory.explain(extended = true)

== Parsed Logical Plan ==

'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]

+- AnalysisBarrier

+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==

id: bigint, new_column: bigint

Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]

+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==

Project [id#0L, (id#0L + 5) AS new_column#3L]

+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==

*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]

+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its analyzed property (that together with numberedTreeString method is a very good “debugging” tool).



val analyzedPlan = inventory.queryExecution.analyzed
scala> println(analyzedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

val analyzedPlan = inventory.queryExecution.analyzed

scala> println(analyzedPlan.numberedTreeString)

00 Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]

01 +- Range (0, 5, step=1, splits=Some(8))

Analyzer defines extendedResolutionRules extension point for additional logical evaluation rules that a custom Analyzer can use to extend the Resolution rule batch. The rules are added at the end of the Resolution batch.

Note	SessionState uses its own `Analyzer` with custom extendedResolutionRules, postHocResolutionRules, and extendedCheckRules extension methods.

Table 1. Analyzer’s Internal Registries and Counters
Name	Description
`extendedResolutionRules`	Additional rules for Resolution batch. Empty by default
`fixedPoint`	`FixedPoint` with maxIterations for Hints, Substitution, Resolution and Cleanup batches. Set when `Analyzer` is created (and can be defined explicitly or through optimizerMaxIterations configuration setting.
`postHocResolutionRules`	The only rules in Post-Hoc Resolution batch if defined (that are executed in one pass, i.e. `Once` strategy). Empty by default

Analyzer is used by QueryExecution to resolve the managed LogicalPlan (and, as a sort of follow-up, assert that a structured query has already been properly analyzed, i.e. no failed or unresolved or somehow broken logical plan operators and expressions exist).

Tip

Enable TRACE or DEBUG logging levels for the respective session-specific loggers to see what happens inside Analyzer.

org.apache.spark.sql.internal.SessionState$$anon$1
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1 when Hive support is enabled

Add the following line to conf/log4j.properties:



# with no Hive support
log4j.logger.org.apache.spark.sql.internal.SessionState$$anon$1=TRACE

# with Hive support enabled
log4j.logger.org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1=DEBUG

# with no Hive support

log4j.logger.org.apache.spark.sql.internal.SessionState$$anon$1=TRACE

# with Hive support enabled

log4j.logger.org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1=DEBUG

Refer to Logging.

The reason for such weird-looking logger names is that analyzer attribute is created as an anonymous subclass of Analyzer class in the respective SessionStates.

Executing Logical Evaluation Rules — `execute` Method

Analyzer is a RuleExecutor that defines the logical rules (i.e. resolving, removing, and in general modifying it), e.g.

Resolves unresolved relations and functions (including UnresolvedGenerators) using provided SessionCatalog
…

Batch Name Strategy Rules Description

Hints

FixedPoint

ResolveBroadcastHints

Resolves UnresolvedHint logical operators with BROADCAST, BROADCASTJOIN or MAPJOIN hints to ResolvedHint operators

ResolveCoalesceHints

Resolves UnresolvedHint logical operators with COALESCE or REPARTITION hints to ResolvedHint operators

RemoveAllHints

Removes all UnresolvedHint logical operators

Simple Sanity Check

Once

LookupFunctions

Checks whether a function identifier (referenced by an UnresolvedFunction) exists in the function registry. Throws a NoSuchFunctionException if not.

Substitution

FixedPoint

CTESubstitution

Resolves With operators (and substitutes named common table expressions — CTEs)

WindowsSubstitution

Substitutes an UnresolvedWindowExpression with a WindowExpression for WithWindowDefinition logical operators.

EliminateUnions

Eliminates Union of one child into that child

SubstituteUnresolvedOrdinals

Replaces ordinals in Sort and Aggregate logical operators with UnresolvedOrdinal expressions

Resolution

FixedPoint

ResolveTableValuedFunctions

Replaces UnresolvedTableValuedFunction with table-valued function.

ResolveRelations

Resolves:

ResolveReferences

ResolveCreateNamedStruct

Resolves CreateNamedStruct expressions (with NamePlaceholders) to use Literal expressions

ResolveDeserializer

ResolveNewInstance

ResolveUpCast

ResolveGroupingAnalytics

Resolves grouping expressions up in a logical plan tree:

Cube, Rollup and GroupingSets expressions
Filter with Grouping or GroupingID expressions
Sort with Grouping or GroupingID expressions

Expects that all children of a logical operator are already resolved (and given it belongs to a fixed-point batch it will likely happen at some iteration).

Fails analysis when grouping__id Hive function is used.



scala> sql("select grouping__id").queryExecution.analyzed
org.apache.spark.sql.AnalysisException: grouping__id is deprecated; use grouping_id() instead;
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:451)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:448)

scala> sql("select grouping__id").queryExecution.analyzed

org.apache.spark.sql.AnalysisException: grouping__id is deprecated; use grouping_id() instead;

at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)

at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)

at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:451)

at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:448)

Note	`ResolveGroupingAnalytics` is only for grouping functions resolution while ResolveAggregateFunctions is responsible for resolving the other aggregates.

ResolvePivot

Resolves Pivot logical operator to Project with an Aggregate unary logical operator (for supported data types in aggregates) or just a single Aggregate.

ResolveOrdinalInOrderByAndGroupBy

ResolveMissingReferences

ExtractGenerator

ResolveGenerate

ResolveFunctions

Resolves functions using SessionCatalog:

UnresolvedGenerator to a Generator
UnresolvedFunction to a AggregateExpression (with AggregateFunction) or AggregateWindowFunction

If Generator is not found, ResolveFunctions reports the error:



[name] is expected to be a generator. However, its class is [className], which is not a generator.

[name] is expected to be a generator. However, its class is [className], which is not a generator.

ResolveAliases

Replaces UnresolvedAlias expressions with concrete aliases:

NamedExpressions
MultiAlias (for GeneratorOuter and Generator)
Alias (for Cast and ExtractValue)

ResolveSubquery

Resolves subquery expressions (i.e. ScalarSubquery, Exists and In)

ResolveWindowOrder

ResolveWindowFrame

Resolves WindowExpression expressions

ResolveNaturalAndUsingJoin

ExtractWindowExpressions

GlobalAggregates

Resolves (aka replaces) Project operators with AggregateExpression that are not WindowExpression with Aggregate unary logical operators.

ResolveAggregateFunctions

Resolves aggregate functions in Filter and Sort operators

Note	`ResolveAggregateFunctions` skips (i.e. does not resolve) grouping functions that are resolved by ResolveGroupingAnalytics rule.

TimeWindowing

Resolves TimeWindow expressions to Filter with Expand logical operators.

ResolveInlineTables

Resolves UnresolvedInlineTable operators to LocalRelations

TypeCoercion.typeCoercionRules

Type coercion rules

extendedResolutionRules

Post-Hoc Resolution

Once

postHocResolutionRules

View

Once

AliasViewChild

Nondeterministic

Once

PullOutNondeterministic

UDF

Once

HandleNullInputsForUDF

FixNullability

Once

FixNullability

ResolveTimeZone

Once

ResolveTimeZone

Replaces TimeZoneAwareExpression with no timezone with one with session-local time zone.

Cleanup

FixedPoint

CleanupAliases

Tip	Consult the sources of `Analyzer` for the up-to-date list of the evaluation rules.

Creating Analyzer Instance

Analyzer takes the following when created:

SessionCatalog
CatalystConf
Maximum number of iterations (of the FixedPoint rule batches, i.e. Hints, Substitution, Resolution and Cleanup)

Analyzer initializes the internal registries and counters.

Note	`Analyzer` can also be created without specifying the maxIterations argument which is then configured using optimizerMaxIterations configuration setting.

`resolver` Method



resolver: Resolver

resolver: Resolver

resolver requests CatalystConf for Resolver.

Note	`Resolver` is a mere function of two `String` parameters that returns `true` if both refer to the same entity (i.e. for case insensitive equality).

`resolveExpression` Method



resolveExpression(
  expr: Expression,
  plan: LogicalPlan,
  throws: Boolean = false): Expression

resolveExpression(

expr: Expression,

plan: LogicalPlan,

throws: Boolean = false): Expression

resolveExpression…FIXME

Note	`resolveExpression` is a `protected[sql]` method.

Note	`resolveExpression` is used when…FIXME

`commonNaturalJoinProcessing` Internal Method



commonNaturalJoinProcessing(
  left: LogicalPlan,
  right: LogicalPlan,
  joinType: JoinType,
  joinNames: Seq[String],
  condition: Option[Expression]): Project

commonNaturalJoinProcessing(

left: LogicalPlan,

right: LogicalPlan,

joinType: JoinType,

joinNames: Seq[String],

condition: Option[Expression]): Project

commonNaturalJoinProcessing…FIXME

Note	`commonNaturalJoinProcessing` is used when…FIXME

`executeAndCheck` Method



executeAndCheck(plan: LogicalPlan): LogicalPlan

executeAndCheck(plan: LogicalPlan): LogicalPlan

executeAndCheck…FIXME

Note	`executeAndCheck` is used exclusively when `QueryExecution` is requested for the analyzed logical plan.

UnsupportedOperationChecker

2012-07-04admin阅读(1389)

UnsupportedOperationChecker

UnsupportedOperationChecker is…FIXME

`checkForBatch` Method



checkForBatch(plan: LogicalPlan): Unit

checkForBatch(plan: LogicalPlan): Unit

checkForBatch…FIXME

Note	`checkForBatch` is used when…FIXME

QueryExecution — Structured Query Execution Pipeline

2012-07-03admin阅读(1851)

QueryExecution — Structured Query Execution Pipeline

QueryExecution represents the execution pipeline of a structured query (as a Dataset) with execution stages (phases).

Figure 1. Query Execution — From SQL through Dataset to RDD

Note	When you execute an operator on a `Dataset` it triggers query execution that gives the good ol’ `RDD` of internal binary rows, i.e. `RDD[InternalRow]`, that is Spark’s execution plan followed by executing an RDD action and so the result of the structured query.

You can access the QueryExecution of a Dataset using queryExecution attribute.



val ds: Dataset[Long] = ...
val queryExec = ds.queryExecution

val ds: Dataset[Long] = ...

val queryExec = ds.queryExecution

QueryExecution is the result of executing a LogicalPlan in a SparkSession (and so you could create a Dataset from a logical operator or use the QueryExecution after executing a logical operator).



val plan: LogicalPlan = ...
val qe = new QueryExecution(sparkSession, plan)

val plan: LogicalPlan = ...

val qe = new QueryExecution(sparkSession, plan)

Attribute / Phase Description

analyzed

Analyzed logical plan that has passed Analyzer‘s check rules.

Tip	Beside `analyzed`, you can use Dataset.explain basic action (with `extended` flag enabled) or SQL’s `EXPLAIN EXTENDED` to see the analyzed logical plan of a structured query.

withCachedData

analyzed logical plan after CacheManager was requested to replace logical query segments with cached query plans.

withCachedData makes sure that the logical plan can be analyzed and uses supported operations only.

optimizedPlan

Optimized logical plan that is the result of executing the logical query plan optimizer on the withCachedData logical plan.

sparkPlan

Physical plan (after SparkPlanner has planned the optimized logical plan).

Note	`sparkPlan` is the first physical plan from the collection of all possible physical plans.

Note	It is guaranteed that Catalyst’s `QueryPlanner` (which `SparkPlanner` extends) will always generate at least one physical plan.

executedPlan

Optimized physical query plan that is in the final optimized “shape” and therefore ready for execution, i.e. the physical sparkPlan with physical preparation rules applied.

Note	Amongst the physical optimization rules that `executedPlan` phase triggers is the CollapseCodegenStages physical preparation rule that collapses physical operators that support code generation together as a WholeStageCodegenExec operator.

Note

executedPlan physical plan is used when:

Dataset.explain operator is used to show the logical and physical query plans of a structured query
Dataset.localCheckpoint and Dataset.checkpoint operators are used (through checkpoint)
Dataset.foreach and Dataset.foreachPartition actions are used (through withNewRDDExecutionId)
Dataset is requested to execute an action under a new execution ID (e.g. for the Dataset operators: collect, count, head and toLocalIterator)
CacheManager is requested to cacheQuery (e.g. for Dataset.persist basic action) or recacheByCondition
QueryExecution is requested for the RDD[InternalRow] of a structured query (in the toRdd query execution phase), simpleString, toString, stringWithStats, codegenToSeq, and the Hive-compatible output format
SQLExecution is requested to execute a Dataset action under a new execution id
PlanSubqueries physical query optimization is executed
AnalyzeColumnCommand and ExplainCommand logical commands are executed
DebugQuery is requested for debug and debugCodegen

toRdd

RDD of internal binary rows (i.e. RDD[InternalRow]) after executing the executedPlan.

The RDD is the top-level RDD of the DAG of RDDs (that represent physical operators).

Note	`toRdd` is a “boundary” between two Spark modules: Spark SQL and Spark Core. After you have executed `toRdd` (directly or not), you basically “leave” Spark SQL’s Dataset world and “enter” Spark Core’s RDD space.

toRdd triggers a structured query execution (i.e. physical planning, but not execution of the plan) using SparkPlan.execute that recursively triggers execution of every child physical operator in the physical plan tree.

Note	You can use SparkSession.internalCreateDataFrame to apply a schema to an `RDD[InternalRow]`.

Note	Use Dataset.rdd to access the `RDD[InternalRow]` with internal binary rows deserialized to a Scala type.

You can access the lazy attributes as follows:



val dataset: Dataset[Long] = ...
dataset.queryExecution.executedPlan

val dataset: Dataset[Long] = ...

dataset.queryExecution.executedPlan

QueryExecution uses the Catalyst Query Optimizer and Tungsten for better structured query performance.

Table 2. QueryExecution’s Properties
Name	Description
`planner`	SparkPlanner

QueryExecution uses the input SparkSession to access the current SparkPlanner (through SessionState) when it is created. It then computes a SparkPlan (a PhysicalPlan exactly) using the planner. It is available as the sparkPlan attribute.

Note	A variant of `QueryExecution` that Spark Structured Streaming uses for query planning is `IncrementalExecution`. Refer to IncrementalExecution — QueryExecution of Streaming Datasets in the Spark Structured Streaming gitbook.

Tip	Use explain operator to know about the logical and physical plans of a `Dataset`.



val ds = spark.range(5)
scala> ds.queryExecution
res17: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
Range 0, 5, 1, 8, [id#39L]

== Analyzed Logical Plan ==
id: bigint
Range 0, 5, 1, 8, [id#39L]

== Optimized Logical Plan ==
Range 0, 5, 1, 8, [id#39L]

== Physical Plan ==
WholeStageCodegen
:  +- Range 0, 1, 8, 5, [id#39L]

val ds = spark.range(5)

scala> ds.queryExecution

res17: org.apache.spark.sql.execution.QueryExecution =

== Parsed Logical Plan ==

Range 0, 5, 1, 8, [id#39L]

== Analyzed Logical Plan ==

id: bigint

Range 0, 5, 1, 8, [id#39L]

== Optimized Logical Plan ==

Range 0, 5, 1, 8, [id#39L]

== Physical Plan ==

WholeStageCodegen

: +- Range 0, 1, 8, 5, [id#39L]

Note	`QueryExecution` belongs to `org.apache.spark.sql.execution` package.

Note	`QueryExecution` is a transient feature of a Dataset, i.e. it is not preserved across serializations.

Text Representation With Statistics — `stringWithStats` Method



stringWithStats: String

stringWithStats: String

stringWithStats…FIXME

Note	`stringWithStats` is used exclusively when `ExplainCommand` logical command is executed (with `cost` flag enabled).

debug Object

Caution

FIXME

Building Complete Text Representation — `completeString` Internal Method

Caution

FIXME

Creating QueryExecution Instance

QueryExecution takes the following when created:

Physical Query Optimizations (Physical Plan Preparation Rules) — `preparations` Method



preparations: Seq[Rule[SparkPlan]]

preparations: Seq[Rule[SparkPlan]]

preparations is the set of the physical query optimization rules that transform a physical query plan to be more efficient and optimized for execution (i.e. Rule[SparkPlan]).

The preparations physical query optimizations are applied sequentially (one by one) to a physical plan in the following order:

Note	`preparations` rules are used when: `QueryExecution` is requested for the executedPlan physical plan (through prepareForExecution) (Spark Structured Streaming) `IncrementalExecution` is requested for the physical optimization rules for streaming structured queries

Applying preparations Physical Query Optimization Rules to Physical Plan — `prepareForExecution` Method



prepareForExecution(plan: SparkPlan): SparkPlan

prepareForExecution(plan: SparkPlan): SparkPlan

prepareForExecution takes physical preparation rules and applies them one by one to the input physical plan.

Note	`prepareForExecution` is used exclusively when `QueryExecution` is requested to prepare the physical plan for execution.

`assertSupported` Method



assertSupported(): Unit

assertSupported(): Unit

assertSupported requests UnsupportedOperationChecker to checkForBatch when…FIXME

Note	`assertSupported` is used exclusively when `QueryExecution` is requested for withCachedData logical plan.

Creating Analyzed Logical Plan and Checking Correctness — `assertAnalyzed` Method



assertAnalyzed(): Unit

assertAnalyzed(): Unit

assertAnalyzed triggers initialization of analyzed (which is almost like executing it).

Note	`assertAnalyzed` executes analyzed by accessing it and throwing the result away. Since `analyzed` is a lazy value in Scala, it will then get initialized for the first time and stays so forever.

assertAnalyzed then requests Analyzer to validate analysis of the logical plan (i.e. analyzed).

Note

assertAnalyzed uses SparkSession to access the current SessionState that it then uses to access the Analyzer.

In Scala the access path looks as follows.



sparkSession.sessionState.analyzer

sparkSession.sessionState.analyzer

In case of any AnalysisException, assertAnalyzed creates a new AnalysisException to make sure that it holds analyzed and reports it.

Note	`assertAnalyzed` is used when: `Dataset` is created `QueryExecution` is requested for `LogicalPlan` with cached data CreateViewCommand and AlterViewAsCommand are executed

Building Text Representation with Cost Stats — `toStringWithStats` Method



toStringWithStats: String

toStringWithStats: String

toStringWithStats is a mere alias for completeString with appendStats flag enabled.

Note	`toStringWithStats` is a custom toString with cost statistics.



// test dataset
val dataset = spark.range(20).limit(2)

// toStringWithStats in action - note Optimized Logical Plan section with Statistics
scala> dataset.queryExecution.toStringWithStats
res6: String =
== Parsed Logical Plan ==
GlobalLimit 2
+- LocalLimit 2
   +- Range (0, 20, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint
GlobalLimit 2
+- LocalLimit 2
   +- Range (0, 20, step=1, splits=Some(8))

== Optimized Logical Plan ==
GlobalLimit 2, Statistics(sizeInBytes=32.0 B, rowCount=2, isBroadcastable=false)
+- LocalLimit 2, Statistics(sizeInBytes=160.0 B, isBroadcastable=false)
   +- Range (0, 20, step=1, splits=Some(8)), Statistics(sizeInBytes=160.0 B, isBroadcastable=false)

== Physical Plan ==
CollectLimit 2
+- *Range (0, 20, step=1, splits=Some(8))

// test dataset

val dataset = spark.range(20).limit(2)

// toStringWithStats in action - note Optimized Logical Plan section with Statistics

scala> dataset.queryExecution.toStringWithStats

res6: String =

== Parsed Logical Plan ==

GlobalLimit 2

+- LocalLimit 2

+- Range (0, 20, step=1, splits=Some(8))

== Analyzed Logical Plan ==

id: bigint

GlobalLimit 2

+- LocalLimit 2

+- Range (0, 20, step=1, splits=Some(8))

== Optimized Logical Plan ==

GlobalLimit 2, Statistics(sizeInBytes=32.0 B, rowCount=2, isBroadcastable=false)

+- LocalLimit 2, Statistics(sizeInBytes=160.0 B, isBroadcastable=false)

+- Range (0, 20, step=1, splits=Some(8)), Statistics(sizeInBytes=160.0 B, isBroadcastable=false)

== Physical Plan ==

CollectLimit 2

+- *Range (0, 20, step=1, splits=Some(8))

Note	`toStringWithStats` is used exclusively when `ExplainCommand` is executed (only when `cost` attribute is enabled).

Transforming SparkPlan Execution Result to Hive-Compatible Output Format — `hiveResultString` Method



hiveResultString(): Seq[String]

hiveResultString(): Seq[String]

hiveResultString returns the result as a Hive-compatible output format.



scala> spark.range(5).queryExecution.hiveResultString
res0: Seq[String] = ArrayBuffer(0, 1, 2, 3, 4)

scala> spark.read.csv("people.csv").queryExecution.hiveResultString
res4: Seq[String] = ArrayBuffer(id	name	age, 0	Jacek	42)

scala> spark.range(5).queryExecution.hiveResultString

res0: Seq[String] = ArrayBuffer(0, 1, 2, 3, 4)

scala> spark.read.csv("people.csv").queryExecution.hiveResultString

res4: Seq[String] = ArrayBuffer(id name age, 0 Jacek 42)

Internally, hiveResultString transformation the SparkPlan.

Table 3. hiveResultString’s SparkPlan Transformations (in execution order)
SparkPlan	Description
ExecutedCommandExec for DescribeTableCommand	Executes `DescribeTableCommand` and transforms every Row to a Hive-compatible output format.
ExecutedCommandExec for ShowTablesCommand	Executes `ExecutedCommandExec` and transforms the result to a collection of table names.
Any other SparkPlan	Executes `SparkPlan` and transforms the result to a Hive-compatible output format.

Note	`hiveResultString` is used exclusively when `SparkSQLDriver` (of ThriftServer) runs a command.

Extended Text Representation with Logical and Physical Plans — `toString` Method



toString: String

toString: String

Note	`toString` is part of Java’s `Object` Contract to…FIXME.

toString is a mere alias for completeString with appendStats flag disabled.

Note	`toString` is on the “other” side of toStringWithStats which has `appendStats` flag enabled.

Simple (Basic) Text Representation — `simpleString` Method



simpleString: String

simpleString: String

simpleString requests the optimized SparkPlan for the text representation (of all nodes in the query tree) with verbose flag turned off.

In the end, simpleString adds == Physical Plan == header to the text representation and redacts sensitive information.



import org.apache.spark.sql.{functions => f}
val q = spark.range(10).withColumn("rand", f.rand())
val output = q.queryExecution.simpleString

scala> println(output)
== Physical Plan ==
*(1) Project [id#5L, rand(6017561978775952851) AS rand#7]
+- *(1) Range (0, 10, step=1, splits=8)

import org.apache.spark.sql.{functions => f}

val q = spark.range(10).withColumn("rand", f.rand())

val output = q.queryExecution.simpleString

scala> println(output)

== Physical Plan ==

*(1) Project [id#5L, rand(6017561978775952851) AS rand#7]

+- *(1) Range (0, 10, step=1, splits=8)

Note	`simpleString` is used when: `ExplainCommand` is executed Spark Structured Streaming’s `StreamingExplainCommand` is executed

Redacting Sensitive Information — `withRedaction` Internal Method



withRedaction(message: String): String

withRedaction(message: String): String

withRedaction takes the value of spark.sql.redaction.string.regex configuration property (as the regular expression to point at sensitive information) and requests Spark Core’s Utils to redact sensitive information in the input message.

Note	Internally, Spark Core’s `Utils.redact` uses Java’s `Regex.replaceAllIn` to replace all matches of a pattern with a string.

Note	`withRedaction` is used when `QueryExecution` is requested for the simple, extended and with statistics text representations.

InternalRowDataWriterFactory

2012-07-02admin阅读(1496)

InternalRowDataWriterFactory

InternalRowDataWriterFactory is…FIXME

`createDataWriter` Method



createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[InternalRow]

createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[InternalRow]

Note	`createDataWriter` is part of DataWriterFactory Contract to…FIXME.

createDataWriter…FIXME

DataWriterFactory

2012-07-01admin阅读(2922)

DataWriterFactory

DataWriterFactory is a contract…FIXME



package org.apache.spark.sql.sources.v2.writer;

public interface DataWriterFactory<T> extends Serializable {
  DataWriter<T> createDataWriter(int partitionId, int attemptNumber);
}

package org.apache.spark.sql.sources.v2.writer;

public interface DataWriterFactory<T> extends Serializable {

DataWriter<T> createDataWriter(int partitionId, int attemptNumber);

}

Note	`DataWriterFactory` is an `Evolving` contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release. In other words, using the contract is as treading on thin ice.

Table 1. DataWriterFactory Contract
Method	Description
`createDataWriter`	Gives the DataWriter for a partition ID and attempt number Used when: `InternalRowDataWriterFactory` is requested to createDataWriter `DataWritingSparkTask` is requested to run and runContinuous

DataWritingSparkTask

2012-06-30admin阅读(1651)

DataWritingSparkTask

DataWritingSparkTask is…FIXME

`run` Method



run(
  writeTask: DataWriterFactory[InternalRow],
  context: TaskContext,
  iter: Iterator[InternalRow]): WriterCommitMessage

run(

writeTask: DataWriterFactory[InternalRow],

context: TaskContext,

iter: Iterator[InternalRow]): WriterCommitMessage

run…FIXME

Note	`run` is used when…FIXME

`runContinuous` Method



runContinuous(
  writeTask: DataWriterFactory[InternalRow],
  context: TaskContext,
  iter: Iterator[InternalRow]): WriterCommitMessage

runContinuous(

writeTask: DataWriterFactory[InternalRow],

context: TaskContext,

iter: Iterator[InternalRow]): WriterCommitMessage

runContinuous…FIXME

Note	`runContinuous` is used when…FIXME

DataWriter

2012-06-29admin阅读(1157)

DataWriter

DataWriter is…FIXME

DataSourceRDDPartition

2012-06-28admin阅读(1460)

DataSourceRDDPartition

DataSourceRDDPartition is a Spark Core Partition of DataSourceRDD and Spark Structured Streaming’s ContinuousDataSourceRDD RDDs.

DataSourceRDDPartition is created when:

DataSourceRDD and Spark Structured Streaming’s ContinuousDataSourceRDD are requested for partitions
DataSourceRDD and Spark Structured Streaming’s ContinuousDataSourceRDD are requested to compute a partition
DataSourceRDD and Spark Structured Streaming’s ContinuousDataSourceRDD are requested for preferred locations

DataSourceRDDPartition takes the following when created:

Partition index
DataReaderFactory

上一页
1
···
37
38
39
40
41
42
43
...
下一页
共 58 页

spark-sql 第40页

SparkOptimizer — Logical Query Plan Optimizer

Creating SparkOptimizer Instance

Extension Point for Additional Pre-Optimization Batches — preOptimizationBatches Method

Extension Point for Additional Post-Hoc Optimization Batches — postHocOptimizationBatches Method

Further Reading and Watching

CheckAnalysis — Analysis Validation

Validating Analysis of Logical Plan (and Marking Plan As Analyzed) — checkAnalysis Method

Extended Analysis Check Rules — extendedCheckRules Extension Point

checkSubqueryExpression Internal Method

Analyzer — Logical Query Plan Analyzer

Executing Logical Evaluation Rules — execute Method

Creating Analyzer Instance

resolver Method

resolveExpression Method

commonNaturalJoinProcessing Internal Method

executeAndCheck Method

UnsupportedOperationChecker

checkForBatch Method

QueryExecution — Structured Query Execution Pipeline

Text Representation With Statistics — stringWithStats Method

debug Object

Building Complete Text Representation — completeString Internal Method

Creating QueryExecution Instance

Physical Query Optimizations (Physical Plan Preparation Rules) — preparations Method

Applying preparations Physical Query Optimization Rules to Physical Plan — prepareForExecution Method

assertSupported Method

Creating Analyzed Logical Plan and Checking Correctness — assertAnalyzed Method

Building Text Representation with Cost Stats — toStringWithStats Method

Transforming SparkPlan Execution Result to Hive-Compatible Output Format — hiveResultString Method

Extended Text Representation with Logical and Physical Plans — toString Method

Simple (Basic) Text Representation — simpleString Method

Redacting Sensitive Information — withRedaction Internal Method

InternalRowDataWriterFactory

createDataWriter Method

DataWriterFactory

DataWritingSparkTask

run Method

runContinuous Method

DataWriter

DataSourceRDDPartition

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Extension Point for Additional Pre-Optimization Batches — `preOptimizationBatches` Method

Extension Point for Additional Post-Hoc Optimization Batches — `postHocOptimizationBatches` Method

Validating Analysis of Logical Plan (and Marking Plan As Analyzed) — `checkAnalysis` Method

Extended Analysis Check Rules — `extendedCheckRules` Extension Point

`checkSubqueryExpression` Internal Method

Executing Logical Evaluation Rules — `execute` Method

`resolver` Method

`resolveExpression` Method

`commonNaturalJoinProcessing` Internal Method

`executeAndCheck` Method

`checkForBatch` Method

Text Representation With Statistics — `stringWithStats` Method

Building Complete Text Representation — `completeString` Internal Method

Physical Query Optimizations (Physical Plan Preparation Rules) — `preparations` Method

Applying preparations Physical Query Optimization Rules to Physical Plan — `prepareForExecution` Method

`assertSupported` Method

Creating Analyzed Logical Plan and Checking Correctness — `assertAnalyzed` Method

Building Text Representation with Cost Stats — `toStringWithStats` Method

Transforming SparkPlan Execution Result to Hive-Compatible Output Format — `hiveResultString` Method

Extended Text Representation with Logical and Physical Plans — `toString` Method

Simple (Basic) Text Representation — `simpleString` Method

Redacting Sensitive Information — `withRedaction` Internal Method

`createDataWriter` Method

`run` Method

`runContinuous` Method