关注 spark技术分享,
撸spark源码 玩spark最佳实践

CreateDataSourceTableAsSelectCommand

admin阅读(1614)

CreateDataSourceTableAsSelectCommand Logical Command

CreateDataSourceTableAsSelectCommand is a logical command that FIXME.

Executing Logical Command — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run…​FIXME

AnalyzeTableCommand

admin阅读(1440)

AnalyzeTableCommand Logical Command — Computing Table-Level Statistics (Total Size and Row Count)

AnalyzeTableCommand is a logical command that computes statistics (i.e. total size and row count) for a table and stores the stats in a metastore.

AnalyzeTableCommand is created exclusively for ANALYZE TABLE with no PARTITION specification and FOR COLUMNS clause.

Executing Logical Command (Computing Table-Level Statistics and Altering Metastore) — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run requests the session-specific SessionCatalog for the metadata of the table and makes sure that it is not a view (aka temporary table).

Note
run uses the input SparkSession to access the session-specific SessionState that in turn gives access to the current SessionCatalog.


run computes the total size and, without NOSCAN flag, the row count statistics of the table.

Note
run uses SparkSession to find the table in a metastore.

run throws a AnalysisException when executed on a view.

Note

Row count statistics triggers a Spark job to count the number of rows in a table (that happens with ANALYZE TABLE with no NOSCAN flag).

Creating AnalyzeTableCommand Instance

AnalyzeTableCommand takes the following when created:

  • TableIdentifier

  • noscan flag (enabled by default) that indicates whether NOSCAN option was used or not

AnalyzePartitionCommand

admin阅读(1623)

AnalyzePartitionCommand Logical Command — Computing Partition-Level Statistics (Total Size and Row Count)

AnalyzePartitionCommand is a logical command that computes statistics (i.e. total size and row count) for table partitions and stores the stats in a metastore.

AnalyzePartitionCommand is created exclusively for ANALYZE TABLE with PARTITION specification only (i.e. no FOR COLUMNS clause).

Executing Logical Command (Computing Partition-Level Statistics and Altering Metastore) — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run requests the session-specific SessionCatalog for the metadata of the table and makes sure that it is not a view.

Note
run uses the input SparkSession to access the session-specific SessionState that in turn is used to access the current SessionCatalog.

run requests the session-specific SessionCatalog for the partitions per the partition specification.

run finishes when the table has no partitions defined in a metastore.

run calculates total size (in bytes) (aka partition location size) for every table partition and creates a CatalogStatistics with the current statistics if different from the statistics recorded in the metastore (with a new row count statistic computed earlier).

In the end, run alters table partition metadata for partitions with the statistics changed.

run reports a NoSuchPartitionException when partitions do not match the metastore.

run reports an AnalysisException when executed on a view.

Computing Row Count Statistics Per Partition — calculateRowCountsPerPartition Internal Method

calculateRowCountsPerPartition…​FIXME

Note
calculateRowCountsPerPartition is used exclusively when AnalyzePartitionCommand is executed.

getPartitionSpec Internal Method

getPartitionSpec…​FIXME

Note
getPartitionSpec is used exclusively when AnalyzePartitionCommand is executed.

Creating AnalyzePartitionCommand Instance

AnalyzePartitionCommand takes the following when created:

  • TableIdentifier

  • Partition specification

  • noscan flag (enabled by default) that indicates whether NOSCAN option was used or not

AnalyzeColumnCommand

admin阅读(1514)

AnalyzeColumnCommand Logical Command for ANALYZE TABLE…COMPUTE STATISTICS FOR COLUMNS SQL Command

AnalyzeColumnCommand is a logical command for ANALYZE TABLE with FOR COLUMNS clause (and no PARTITION specification).

AnalyzeColumnCommand can generate column histograms when spark.sql.statistics.histogram.enabled configuration property is turned on (which is disabled by default). AnalyzeColumnCommand supports column histograms for the following data types:

  • IntegralType

  • DecimalType

  • DoubleType

  • FloatType

  • DateType

  • TimestampType

Note
Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan.

Note
AnalyzeColumnCommand is described by analyze labeled alternative in statement expression in SqlBase.g4 and parsed using SparkSqlAstBuilder.
Note
AnalyzeColumnCommand is not supported on views.

Executing Logical Command — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run calculates the following statistics:

  • sizeInBytes

  • stats for each column

Caution
FIXME

Computing Statistics for Specified Columns — computeColumnStats Internal Method

computeColumnStats…​FIXME

Note
computeColumnStats is used exclusively when AnalyzeColumnCommand is executed.

computePercentiles Internal Method

computePercentiles…​FIXME

Note
computePercentiles is used exclusively when AnalyzeColumnCommand is executed (and computes column statistics).

Creating AnalyzeColumnCommand Instance

AnalyzeColumnCommand takes the following when created:

  • TableIdentifier

  • Column names

AnalysisBarrier

admin阅读(1778)

AnalysisBarrier Leaf Logical Operator — Hiding Child Query Plan in Analysis

AnalysisBarrier is a leaf logical operator that is a wrapper of an analyzed logical plan to hide it from the Spark Analyzer. The purpose of AnalysisBarrier is to prevent the child logical plan from being analyzed again (and increasing the time spent on query analysis).

AnalysisBarrier is created when:

AnalysisBarrier takes a single child logical query plan when created.

AnalysisBarrier returns the child logical query plan when requested for the inner nodes (that should be shown as an inner nested tree of this node).

AnalysisBarrier simply requests the child logical query plan for the output schema attributes.

AnalysisBarrier simply requests the child logical query plan for the isStreaming flag.

AnalysisBarrier simply requests the child logical operator for the canonicalized version.

AlterViewAsCommand

admin阅读(1406)

AlterViewAsCommand Logical Command

AlterViewAsCommand is a logical command for ALTER VIEW SQL statement to alter a view.

AlterViewAsCommand works with a table identifier (as TableIdentifier), the original SQL text, and a LogicalPlan for the SQL query.

Note
AlterViewAsCommand is described by alterViewQuery labeled alternative in statement expression in SqlBase.g4 and parsed using SparkSqlParser.

When executed, AlterViewAsCommand attempts to alter a temporary view in the current SessionCatalog first, and if that “fails”, alters the permanent view.

Executing Logical Command — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run…​FIXME

alterPermanentView Internal Method

alterPermanentView…​FIXME

Note
alterPermanentView is used when…​FIXME

Aggregate

admin阅读(1516)

Aggregate Unary Logical Operator

Aggregate is a unary logical operator that holds the following:

Aggregate is created to represent the following (after a logical plan is analyzed):

Note
Aggregate logical operator is translated to one of HashAggregateExec, ObjectHashAggregateExec or SortAggregateExec physical operators in Aggregation execution planning strategy.
Table 1. Aggregate’s Properties
Name Description

maxRows

Child logical plan‘s maxRows

Note
Part of LogicalPlan contract.

output

Note
Part of QueryPlan contract.

resolved

Enabled when:

Note
Part of LogicalPlan contract.

validConstraints

The (expression) constraints of child logical plan and non-aggregate aggregate named expressions.

Note
Part of QueryPlan contract.

Rule-Based Logical Query Optimization Phase

PushDownPredicate logical plan optimization applies so-called filter pushdown to a Pivot operator when under Filter operator and with all expressions deterministic.

SaveAsHiveFile Contract — DataWritingCommands That Write Query Result As Hive Files

admin阅读(1419)

SaveAsHiveFile Contract — DataWritingCommands That Write Query Result As Hive Files

SaveAsHiveFile is the extension of the DataWritingCommand contract for commands that saveAsHiveFile.

Table 1. SaveAsHiveFiles
SaveAsHiveFile Description

InsertIntoHiveDirCommand

InsertIntoHiveTable

saveAsHiveFile Method

saveAsHiveFile…​FIXME

Note

saveAsHiveFile is used when:

DataWritingCommand Contract — Logical Commands That Write Data

admin阅读(1554)

DataWritingCommand Contract — Logical Commands That Write Data

DataWritingCommand is an extension of the Command contract for logical commands that write query data to a relation when executed.

DataWritingCommand is resolved to a DataWritingCommandExec physical operator when BasicOperators execution planning strategy is executed (i.e. plan a logical plan to a physical plan).

Table 1. DataWritingCommand Contract
Property Description

outputColumnNames

The output column names of the analyzed input query plan

Used when DataWritingCommand is requested for the outputColumns

query

The logical query plan representing the data to write (i.e. whose result will be inserted into a relation)

Used when DataWritingCommand is requested for the child nodes and outputColumns.

run

Executes the command

Used when:

When requested for the child nodes, DataWritingCommand simply returns the logical query plan.

DataWritingCommand defines custom performance metrics.

Table 2. DataWritingCommand’s Performance Metrics
Key Name (in web UI) Description

numFiles

number of written files

numOutputBytes

bytes of written output

numOutputRows

number of output rows

numParts

number of dynamic part

The performance metrics are used when:

Table 3. DataWritingCommands (Direct Implementations)
DataWritingCommand Description

CreateDataSourceTableAsSelectCommand

CreateHiveTableAsSelectCommand

InsertIntoHadoopFsRelationCommand

SaveAsHiveFile

Contract for commands that write query result as Hive files

basicWriteJobStatsTracker Method

basicWriteJobStatsTracker simply creates and returns a new BasicWriteJobStatsTracker (with the given Hadoop Configuration and the metrics).

Note

basicWriteJobStatsTracker is used when:

Output Columns — outputColumns Method

outputColumns…​FIXME

Note

outputColumns is used when:

关注公众号:spark技术分享

联系我们联系我们