关注 spark技术分享,
撸spark源码 玩spark最佳实践

RunnableCommand Contract — Generic Logical Command with Side Effects

admin阅读(1454)

RunnableCommand Contract — Generic Logical Command with Side Effects

RunnableCommand is the generic logical command that is executed eagerly for its side effects.

RunnableCommand defines one abstract method run that computes a collection of Row records with the side effect, i.e. the result of executing a command.

Note
RunnableCommand logical operator is resolved to ExecutedCommandExec physical operator in BasicOperators execution planning strategy.
Note

run is executed when:

Table 1. Available RunnableCommands
RunnableCommand Description

AddFileCommand

AddJarCommand

AlterDatabasePropertiesCommand

AlterTableAddPartitionCommand

AlterTableChangeColumnCommand

AlterTableDropPartitionCommand

AlterTableRecoverPartitionsCommand

AlterTableRenameCommand

AlterTableRenamePartitionCommand

AlterTableSerDePropertiesCommand

AlterTableSetLocationCommand

AlterTableSetPropertiesCommand

AlterTableUnsetPropertiesCommand

AlterViewAsCommand

AnalyzeColumnCommand

AnalyzePartitionCommand

AnalyzeTableCommand

CacheTableCommand

When executed, CacheTableCommand creates a DataFrame followed by registering a temporary view for the optional query.

CacheTableCommand requests the session-specific Catalog to cache the table.

Note
CacheTableCommand uses SparkSession to access the Catalog.

If the caching is not LAZY (which is not by default), CacheTableCommand creates a DataFrame for the table and counts the rows (that will trigger the caching).

Note
CacheTableCommand uses a Spark SQL pattern to trigger DataFrame caching by executing count operation.

ClearCacheCommand

CreateDatabaseCommand

CreateDataSourceTableAsSelectCommand

When executed, …​FIXME

Used exclusively when DataSourceAnalysis posthoc logical resolution rule resolves a CreateTable logical operator with queries using non-Hive table providers (which is when DataFrameWriter saves a DataFrame to a non-Hive table or for Create Table As Select SQL statements)

CreateDataSourceTableCommand

CreateFunctionCommand

CreateHiveTableAsSelectCommand

CreateTableCommand

CreateTableLikeCommand

CreateTempViewUsing

CreateViewCommand

DescribeColumnCommand

DescribeDatabaseCommand

DescribeFunctionCommand

DescribeTableCommand

DropDatabaseCommand

DropFunctionCommand

DropTableCommand

ExplainCommand

InsertIntoDataSourceCommand

InsertIntoHadoopFsRelationCommand

InsertIntoHiveTable

ListFilesCommand

ListJarsCommand

LoadDataCommand

RefreshResource

RefreshTable

ResetCommand

SaveIntoDataSourceCommand

When executed, requests DataSource to write a DataFrame to a data source per save mode.

Used exclusively when DataFrameWriter is requested to save a DataFrame to a data source.

SetCommand

SetDatabaseCommand

ShowColumnsCommand

ShowCreateTableCommand

ShowDatabasesCommand

ShowFunctionsCommand

ShowPartitionsCommand

ShowTablePropertiesCommand

ShowTablesCommand

StreamingExplainCommand

TruncateTableCommand

UncacheTableCommand

Command Contract — Eagerly-Executed Logical Operator

admin阅读(1485)

Command Contract — Eagerly-Executed Logical Operator

Command is the marker interface for logical operators that represent non-query commands that are executed early in the query plan lifecycle (unlike logical plans in general).

Note
Command is executed when a Dataset is requested for the logical plan (which is after the query has been analyzed).

Command has no output schema by default.

Command has no child logical operators (which makes it similar to leaf logical operators).

Table 1. Commands (Direct Implementations)
Command Description

DataWritingCommand

RunnableCommand

LogicalPlan Contract — Logical Operator with Children and Expressions / Logical Query Plan

admin阅读(1828)

LogicalPlan Contract — Logical Relational Operator with Children and Expressions / Logical Query Plan

LogicalPlan is an extension of the QueryPlan contract for logical operators to build a logical query plan (i.e. a tree of logical operators).

Note
A logical query plan is a tree of nodes of logical operators that in turn can have (trees of) Catalyst expressions. In other words, there are at least two trees at every level (operator).

LogicalPlan can be resolved.

In order to get the logical plan of a structured query you should use the QueryExecution.

LogicalPlan goes through execution stages (as a QueryExecution). In order to convert a LogicalPlan to a QueryExecution you should use SessionState and request it to “execute” the plan.

Note

A common idiom in Spark SQL to make sure that a logical plan can be analyzed is to request a SparkSession for the SessionState that is in turn requested to execute the logical plan (which simply creates a QueryExecution).

Note

Another common idiom in Spark SQL to convert a LogicalPlan into a Dataset is to use Dataset.ofRows internal method that executes the logical plan followed by creating a Dataset with the QueryExecution and a RowEncoder.

A logical operator is considered partially resolved when its child operators are resolved (aka children resolved).

A logical operator is (fully) resolved to a specific schema when all expressions and the children are resolved.

A logical plan knows the size of objects that are results of query operators, like join, through Statistics object.

A logical plan knows the maximum number of records it can compute.

LogicalPlan can be streaming if it contains one or more structured streaming sources.

Note
LogicalPlan is in the end transformed to a physical query plan.
Table 1. Logical Operators / Specialized Logical Plans
LogicalPlan Description

LeafNode

Logical operator with no child operators

UnaryNode

Logical plan with a single child logical operator

BinaryNode

Logical operator with two child logical operators

Command

RunnableCommand

Table 2. LogicalPlan’s Internal Registries and Counters
Name Description

statsCache

Cached plan statistics (as Statistics) of the LogicalPlan

Computed and cached in stats.

Used in stats and verboseStringWithSuffix.

Reset in invalidateStatsCache

Getting Cached or Calculating Estimated Statistics — stats Method

stats returns the cached plan statistics or computes a new one (and caches it as statsCache).

Note

stats is used when:

invalidateStatsCache method

Caution
FIXME

verboseStringWithSuffix method

Caution
FIXME

setAnalyzed method

Caution
FIXME

Is Logical Plan Streaming? — isStreaming method

isStreaming is part of the public API of LogicalPlan and is enabled (i.e. true) when a logical plan is a streaming source.

By default, it walks over subtrees and calls itself, i.e. isStreaming, on every child node to find a streaming source.

Note
Streaming Datasets are part of Structured Streaming.

Refreshing Child Logical Plans — refresh Method

refresh calls itself recursively for every child logical operator.

Note
refresh is overriden by LogicalRelation only (that refreshes the location of HadoopFsRelation relations only).
Note

refresh is used when:

resolveQuoted Method

resolveQuoted…​FIXME

Note
resolveQuoted is used when…​FIXME

Resolving Attribute By Name Parts — resolve Method

  1. A protected method

resolve…​FIXME

Note
resolve is used when…​FIXME

WindowSpecDefinition

admin阅读(1475)

WindowSpecDefinition Unevaluable Expression

WindowSpecDefinition is an unevaluable expression (i.e. with no support for eval and doGenCode methods).

WindowSpecDefinition is created when:

Table 1. WindowSpecDefinition’s Properties
Name Description

children

Window partition and order specifications (for which WindowExpression was created).

dataType

Unsupported (i.e. reports a UnsupportedOperationException)

foldable

Disabled (i.e. false)

nullable

Enabled (i.e. true)

resolved

Enabled when children are and the input DataType is valid and the input frameSpecification is a SpecifiedWindowFrame.

sql

Contains PARTITION BY with comma-separated elements of partitionSpec (if defined) with ORDER BY with comma-separated elements of orderSpec (if defined) followed by frameSpecification.

Creating WindowSpecDefinition Instance

WindowSpecDefinition takes the following when created:

  • Expressions for window partition specification

  • Window order specifications (as SortOrder unary expressions)

  • Window frame specification (as WindowFrame)

Validating Data Type Of Window Order– isValidFrameType Internal Method

isValidFrameType is positive (true) when the data type of the window order specification and the input ft data type are as follows:

Otherwise, isValidFrameType is negative (false).

Note
isValidFrameType is used exclusively when WindowSpecDefinition is requested to checkInputDataTypes (with RangeFrame as the window frame specification)

Checking Input Data Types — checkInputDataTypes Method

Note
checkInputDataTypes is part of the Expression Contract to checks the input data types.

checkInputDataTypes…​FIXME

WindowFunction Contract — Window Function Expressions With WindowFrame

admin阅读(1385)

WindowFunction Contract — Window Function Expressions With WindowFrame

WindowFunction is the contract of function expressions that define a WindowFrame in which the window operator must be executed.

Table 1. WindowFunctions (Direct Implementations)
WindowFunction Description

AggregateWindowFunction

OffsetWindowFunction

Defining WindowFrame for Execution — frame Method

frame defines the WindowFrame for function execution, i.e. the WindowFrame in which the window operator must be executed.

frame is UnspecifiedFrame by default.

Note
frame is used when…​FIXME

WindowExpression

admin阅读(1459)

WindowExpression Unevaluable Expression

WindowExpression is an unevaluable expression that represents a window function (over some WindowSpecDefinition).

Note
An unevaluable expression cannot be evaluated to produce a value (neither in interpreted nor code-generated expression evaluations) and has to be resolved (replaced) to some other expressions or logical operators at analysis or optimization phases or they fail analysis.

WindowExpression is created when:

Note
WindowExpression can only be created with AggregateExpression, AggregateWindowFunction or OffsetWindowFunction expressions which is enforced at analysis.

Note
WindowExpression is resolved in ExtractWindowExpressions, ResolveWindowFrame and ResolveWindowOrder logical rules.

  1. Use sqlParser directly as in WithWindowDefinition Example

Table 1. WindowExpression’s Properties
Name Description

children

Collection of two expressions, i.e. windowFunction and WindowSpecDefinition, for which WindowExpression was created.

dataType

DataType of windowFunction

foldable

Whether or not windowFunction is foldable.

nullable

Whether or not windowFunction is nullable.

sql

"[windowFunction].sql OVER [windowSpec].sql"

toString

"[windowFunction] [windowSpec]"

Note
WindowExpression is subject to NullPropagation and DecimalAggregates logical optimizations.
Note
Distinct window functions are not supported which is enforced at analysis.
Note
An offset window function can only be evaluated in an ordered row-based window frame with a single offset which is enforced at analysis.

Catalyst DSL — windowExpr Operator

windowExpr operator in Catalyst DSL creates a WindowExpression expression, e.g. for testing or Spark SQL internals exploration.

Creating WindowExpression Instance

WindowExpression takes the following when created:

UnresolvedWindowExpression

admin阅读(2380)

UnresolvedWindowExpression Unevaluable Expression — WindowExpression With Unresolved Window Specification Reference

UnresolvedWindowExpression is an unevaluable expression that represents…​FIXME

Note
An unevaluable expression cannot be evaluated to produce a value (neither in interpreted nor code-generated expression evaluations) and has to be resolved (replaced) to some other expressions or logical operators at analysis or optimization phases or they fail analysis.

UnresolvedWindowExpression is created when:

  • FIXME

UnresolvedWindowExpression is created to represent a child expression and WindowSpecReference (with an identifier for the window reference) when AstBuilder parses a function evaluated in a windowed context with a WindowSpecReference.

UnresolvedWindowExpression is resolved to a WindowExpression when Analyzer resolves UnresolvedWindowExpressions.

Table 1. UnresolvedWindowExpression’s Properties
Name Description

dataType

Reports a UnresolvedException

foldable

Reports a UnresolvedException

nullable

Reports a UnresolvedException

resolved

Disabled (i.e. false)

UnresolvedStar

admin阅读(1573)

UnresolvedStar Expression

UnresolvedStar is a Star expression that represents a star (i.e. all) expression in a logical query plan.

UnresolvedStar is created when:

UnresolvedStar can never be resolved, and is expanded at analysis (when ResolveReferences logical resolution rule is executed).

Note
UnresolvedStar can only be used in Project, Aggregate or ScriptTransformation logical operators.


Given UnresolvedStar can never be resolved it should not come as a surprise that it cannot be evaluated either (i.e. produce a value given an internal row). When requested to evaluate, UnresolvedStar simply reports a UnsupportedOperationException.

When created, UnresolvedStar takes name parts that, once concatenated, is the target of the star expansion.

Tip

Use star operator from Catalyst DSL’s expressions to create an UnresolvedStar.

You could also use $"" or ' to create an UnresolvedStar, but that requires sbt console (with Spark libraries defined in build.sbt) as the Catalyst DSL expressions implicits interfere with the Spark implicits to create columns.

Note

AstBuilder replaces count(*) (with no DISTINCT keyword) to count(1).

Star Expansion — expand Method

Note
expand is part of Star Contract to…​FIXME.

expand first expands to named expressions per target:

  • For unspecified target, expand gives the output schema of the input logical query plan (that assumes that the star refers to a relation / table)

  • For target with one element, expand gives the table (attribute) in the output schema of the input logical query plan (using qualifiers) if available

With no result earlier, expand then requests the input logical query plan to resolve the target name parts to a named expression.

For a named expression of StructType data type, expand creates an Alias expression with a GetStructField unary expression (with the resolved named expression and the field index).

expand reports a AnalysisException when:

  • The data type of the named expression (when the input logical plan was requested to resolve the target) is not a StructType.

  • Earlier attempts gave no results

UnresolvedRegex

admin阅读(1233)

UnresolvedRegex

UnresolvedRegex is…​FIXME

UnresolvedOrdinal

admin阅读(1615)

UnresolvedOrdinal Unevaluable Leaf Expression

UnresolvedOrdinal is a leaf expression that represents a single integer literal in Sort logical operators (in SortOrder ordering expressions) and in Aggregate logical operators (in grouping expressions) in a logical plan.

UnresolvedOrdinal is created when SubstituteUnresolvedOrdinals logical resolution rule is executed.

UnresolvedOrdinal takes a single ordinal integer when created.

UnresolvedOrdinal is an unevaluable expression and cannot be evaluated (i.e. produce a value given an internal row).

Note
An unevaluable expression cannot be evaluated to produce a value (neither in interpreted nor code-generated expression evaluations) and has to be resolved (replaced) to some other expressions or logical operators at analysis or optimization phases or they fail analysis.

UnresolvedOrdinal can never be resolved (and is replaced at analysis phase).

Note
UnresolvedOrdinal is resolved when ResolveOrdinalInOrderByAndGroupBy logical resolution rule is executed.

UnresolvedOrdinal has no representation in SQL.

Note
UnresolvedOrdinal in GROUP BY ordinal position is not allowed for a select list with a star (*).

关注公众号:spark技术分享

联系我们联系我们