关注 spark技术分享,
撸spark源码 玩spark最佳实践

HiveTableRelation

admin阅读(2687)

HiveTableRelation Leaf Logical Operator — Representing Hive Tables in Logical Plan

HiveTableRelation is a leaf logical operator that represents a Hive table in a logical query plan.

HiveTableRelation is created exclusively when FindDataSourceTable logical evaluation rule is requested to resolve UnresolvedCatalogRelations in a logical plan (for Hive tables).

HiveTableRelation is partitioned when it has at least one partition.

The metadata of a HiveTableRelation (in a catalog) has to meet the requirements:

HiveTableRelation has the output attributes made up of data followed by partition columns.

Note

HiveTableRelation is removed from a logical plan when HiveAnalysis logical rule is executed (and transforms a InsertIntoTable with HiveTableRelation to an InsertIntoHiveTable).

HiveTableRelation is when RelationConversions rule is executed (and converts HiveTableRelations to LogicalRelations).

HiveTableRelation is resolved to HiveTableScanExec physical operator when HiveTableScans strategy is executed.

Computing Statistics — computeStats Method

Note
computeStats is part of LeafNode Contract to compute statistics for cost-based optimizer.

computeStats takes the table statistics from the table metadata if defined and converts them to Spark statistics (with output columns).

If the table statistics are not available, computeStats reports an IllegalStateException.

Creating HiveTableRelation Instance

HiveTableRelation takes the following when created:

  • Table metadata

  • Columns (as a collection of AttributeReferences)

  • Partitions (as a collection of AttributeReferences)

Hint

admin阅读(1179)

Hint Logical Operator

Caution
FIXME

GroupingSets

admin阅读(1754)

GroupingSets Unary Logical Operator

GroupingSets is a unary logical operator that represents SQL’s GROUPING SETS variant of GROUP BY clause.

GroupingSets operator is resolved to an Aggregate logical operator at analysis phase.

Note
GroupingSets can only be created using SQL.
Note
GroupingSets is not supported on Structured Streaming’s streaming Datasets.

GroupingSets is never resolved (as it can only be converted to an Aggregate logical operator).

The output schema of a GroupingSets are exactly the attributes of aggregate named expressions.

Analysis Phase

GroupingSets operator is resolved at analysis phase in the following logical evaluation rules:

GroupingSets operator is resolved to an Aggregate with Expand logical operators.

Creating GroupingSets Instance

GroupingSets takes the following when created:

Generate

admin阅读(1645)

Generate Unary Logical Operator for Lateral Views

Generate is a unary logical operator that is created to represent the following (after a logical plan is analyzed):

resolved flag is…​FIXME

Note
resolved is part of LogicalPlan Contract to…​FIXME.

producedAttributes…​FIXME

The output schema of a Generate is…​FIXME

Note
Generate logical operator is resolved to GenerateExec unary physical operator in BasicOperators execution planning strategy.
Tip

Use generate operator from Catalyst DSL to create a Generate logical operator, e.g. for testing or Spark SQL internals exploration.

Creating Generate Instance

Generate takes the following when created:

Generate initializes the internal registries and counters.

ExternalRDD

admin阅读(1988)

ExternalRDD

ExternalRDD is a leaf logical operator that is a logical representation of (the data from) an RDD in a logical query plan.

ExternalRDD is created when:

ExternalRDD is a MultiInstanceRelation and a ObjectProducer.

Note
ExternalRDD is resolved to ExternalRDDScanExec when BasicOperators execution planning strategy is executed.

newInstance Method

Note
newInstance is part of MultiInstanceRelation Contract to…​FIXME.

newInstance…​FIXME

Computing Statistics — computeStats Method

Note
computeStats is part of LeafNode Contract to compute statistics for cost-based optimizer.

computeStats…​FIXME

Creating ExternalRDD Instance

ExternalRDD takes the following when created:

Creating ExternalRDD — apply Factory Method

apply…​FIXME

Note
apply is used when SparkSession is requested to create a DataFrame from RDD of product types (e.g. Scala case classes, tuples) or Dataset from RDD of a given type.

ExplainCommand

admin阅读(1375)

ExplainCommand Logical Command

ExplainCommand is a logical command with side effect that allows users to see how a structured query is structured and will eventually be executed, i.e. shows logical and physical plans with or without details about codegen and cost statistics.

When executed, ExplainCommand computes a QueryExecution that is then used to output a single-column DataFrame with the following:

  • codegen explain, i.e. WholeStageCodegen subtrees if codegen flag is enabled.

  • extended explain, i.e. the parsed, analyzed, optimized logical plans with the physical plan if extended flag is enabled.

  • cost explain, i.e. optimized logical plan with stats if cost flag is enabled.

  • simple explain, i.e. the physical plan only when no codegen and extended flags are enabled.

ExplainCommand is created by Dataset’s explain operator and EXPLAIN SQL statement (accepting EXTENDED and CODEGEN options).

The following EXPLAIN variants in SQL queries are not supported:

  • EXPLAIN FORMATTED

  • EXPLAIN LOGICAL

The output schema of a ExplainCommand is…​FIXME

Creating ExplainCommand Instance

ExplainCommand takes the following when created:

  • LogicalPlan

  • extended flag whether to include extended details in the output when ExplainCommand is executed (disabled by default)

  • codegen flag whether to include codegen details in the output when ExplainCommand is executed (disabled by default)

  • cost flag whether to include code in the output when ExplainCommand is executed (disabled by default)

ExplainCommand initializes output attribute.

Note
ExplainCommand is created when…​FIXME

Executing Logical Command (Computing Text Representation of QueryExecution) — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run computes QueryExecution and returns its text representation in a single Row.

Internally, run creates a IncrementalExecution for a streaming dataset directly or requests SessionState to execute the LogicalPlan.

Note
Streaming Dataset is part of Spark Structured Streaming.

run then requests QueryExecution to build the output text representation, i.e. codegened, extended (with logical and physical plans), with stats, or simple.

In the end, run creates a Row with the text representation.

Expand

admin阅读(2645)

Expand Unary Logical Operator

Expand is a unary logical operator that represents Cube, Rollup, GroupingSets and TimeWindow logical operators after they have been resolved at analysis phase.

Note
Expand logical operator is resolved to ExpandExec physical operator in BasicOperators execution planning strategy.
Table 1. Expand’s Properties
Name Description

references

AttributeSet from projections

validConstraints

Empty set of expressions

Analysis Phase

Expand logical operator is resolved to at analysis phase in the following logical evaluation rules:

Note
Aggregate → (Cube|Rollup|GroupingSets) → constructAggregate → constructExpand

Rule-Based Logical Query Optimization Phase

Creating Expand Instance

Expand takes the following when created:

Except

admin阅读(1117)

Except

Except is…​FIXME

DropTableCommand

admin阅读(1380)

DropTableCommand Logical Command

DropTableCommand is a logical command for FIXME.

Executing Logical Command — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run…​FIXME

关注公众号:spark技术分享

联系我们联系我们