关注 spark技术分享,
撸spark源码 玩spark最佳实践

Join

admin阅读(1658)

Join Logical Operator

Join is a binary logical operator, i.e. works with two logical operators. Join has a join type and an optional expression condition for the join.

Join is created when…​FIXME

Note
CROSS JOIN is just an INNER JOIN with no join condition.

Join has output schema attributes…​FIXME

Creating Join Instance

Join takes the following when created:

Intersect

admin阅读(1301)

Intersect

Intersect is…​FIXME

InsertIntoTable

admin阅读(1859)

InsertIntoTable Unary Logical Operator

InsertIntoTable is a unary logical operator that represents the following:

InsertIntoTable is created with partition keys that correspond to the partitionSpec part of the following SQL statements:

InsertIntoTable has no partition keys when created as follows:

InsertIntoTable can never be resolved (i.e. InsertIntoTable should not be part of a logical plan after analysis and is supposed to be converted to logical commands at analysis phase).

Table 1. InsertIntoTable’s Logical Resolutions (Conversions)
Logical Command Description

InsertIntoHiveTable

When HiveAnalysis resolution rule transforms InsertIntoTable with a HiveTableRelation

InsertIntoDataSourceCommand

When DataSourceAnalysis posthoc logical resolution resolves an InsertIntoTable with a LogicalRelation over an InsertableRelation (with no partitions defined)

InsertIntoHadoopFsRelationCommand

When DataSourceAnalysis posthoc logical resolution transforms InsertIntoTable with a LogicalRelation over a HadoopFsRelation

Caution
FIXME What’s the difference between HiveAnalysis that converts InsertIntoTable(r: HiveTableRelation…​) to InsertIntoHiveTable and RelationConversions that converts InsertIntoTable(r: HiveTableRelation,…​) to InsertIntoTable (with LogicalRelation)?
Note
Inserting into views or RDD-based tables is not allowed (and fails at analysis).

InsertIntoTable (with UnresolvedRelation leaf logical operator) is created when:

InsertIntoTable has an empty output schema.

Tip

Use insertInto operator from the Catalyst DSL to create an InsertIntoTable logical operator, e.g. for testing or Spark SQL internals exploration.

Creating InsertIntoTable Instance

InsertIntoTable takes the following when created:

  • Logical plan for the table to insert into

  • Partition keys (with optional partition values for dynamic partition insert)

  • Logical plan representing the data to be written

  • overwrite flag that indicates whether to overwrite an existing table or partitions (true) or not (false)

  • ifPartitionNotExists flag

Inserting Into View Not Allowed

Inserting into a view is not allowed, i.e. a query plan with an InsertIntoTable operator with a UnresolvedRelation leaf operator that is resolved to a View unary operator fails at analysis (when ResolveRelations logical resolution is executed).

Inserting Into RDD-Based Table Not Allowed

Inserting into an RDD-based table is not allowed, i.e. a query plan with an InsertIntoTable operator with one of the following logical operators (as the logical plan representing the table) fails at analysis (when PreWriteCheck extended logical check is executed):

InsertIntoHiveTable

admin阅读(1737)

InsertIntoHiveTable Logical Command

InsertIntoHiveTable is…​FIXME

Executing Logical Command — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run…​FIXME

processInsert Internal Method

processInsert…​FIXME

Note
processInsert is used exclusively when InsertIntoHiveTable logical command is executed.

InsertIntoHiveDirCommand

admin阅读(1663)

InsertIntoHiveDirCommand Logical Command

InsertIntoHiveDirCommand is…​FIXME

Executing Logical Command — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run…​FIXME

InsertIntoHadoopFsRelationCommand

admin阅读(4638)

InsertIntoHadoopFsRelationCommand Logical Command

InsertIntoHadoopFsRelationCommand is a DataWritingCommand that writes the data of the query out using the FileFormat.

InsertIntoHadoopFsRelationCommand is created when:

Executing Logical Command — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run uses the spark.sql.hive.manageFilesourcePartitions configuration property to…​FIXME

Caution
FIXME When is the catalogTable defined?
Caution
FIXME When is tracksPartitionsInCatalog of CatalogTable enabled?

run…​FIXME

Creating InsertIntoHadoopFsRelationCommand Instance

InsertIntoHadoopFsRelationCommand takes the following when created:

  • Output Hadoop’s Path

  • Static table partitions (Map[String, String])

  • ifPartitionNotExists flag

  • Partition columns (Seq[Attribute])

  • BucketSpec

  • FileFormat

  • Options (Map[String, String])

  • Logical plan

  • SaveMode

  • CatalogTable

  • FileIndex

  • Output column names

Note

staticPartitions may hold zero or more partitions as follows:

With that, staticPartitions are simply the partitions of an InsertIntoTable logical operator.

InsertIntoDir

admin阅读(1394)

InsertIntoDir Unary Logical Operator

InsertIntoDir is…​FIXME

InsertIntoDataSourceDirCommand

admin阅读(1782)

InsertIntoDataSourceDirCommand Logical Command

InsertIntoDataSourceDirCommand is a logical command that FIXME.

Executing Logical Command — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run…​FIXME

InsertIntoDataSourceCommand

admin阅读(1705)

InsertIntoDataSourceCommand Logical Command

InsertIntoDataSourceCommand is a RunnableCommand that inserts or overwrites data in an InsertableRelation (per overwrite flag).

InsertIntoDataSourceCommand is created exclusively when DataSourceAnalysis logical resolution is executed (and resolves an InsertIntoTable unary logical operator with a LogicalRelation on an InsertableRelation).

InsertIntoDataSourceCommand returns the logical query plan when requested for the inner nodes (that should be shown as an inner nested tree of this node).

Executing Logical Command (Inserting or Overwriting Data in InsertableRelation) — run Method

Note
run is part of RunnableCommand Contract to execute (run) a logical command.

run takes the InsertableRelation (that is the relation of the LogicalRelation).

run then creates a DataFrame for the logical query plan and the input SparkSession.

run requests the DataFrame for the QueryExecution that in turn is requested for the RDD (of the structured query). run requests the LogicalRelation for the output schema.

With the RDD and the output schema, run creates another DataFrame that is the RDD[InternalRow] with the schema applied.

run requests the InsertableRelation to insert or overwrite data.

In the end, since the data in the InsertableRelation has changed, run requests the CacheManager to recacheByPlan with the LogicalRelation.

Note
run requests the SparkSession for SharedState that is in turn requested for the CacheManager.

Creating InsertIntoDataSourceCommand Instance

InsertIntoDataSourceCommand takes the following when created:

InMemoryRelation

admin阅读(1849)

InMemoryRelation Leaf Logical Operator For Cached Physical Query Plans

InMemoryRelation is a leaf logical operator that represents a cached child physical query plan.

InMemoryRelation is created when:

InMemoryRelation is a MultiInstanceRelation so a new instance will be created to appear multiple times in a physical query plan.

Note

InMemoryRelation is created using apply factory method that accepts no output attributes and so uses the output of the child physical plan instead.

Table 1. InMemoryRelation’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

partitionStatistics

PartitionStatistics for the output schema

Used exclusively when InMemoryTableScanExec is created (and initializes stats internal property).

Computing Statistics — computeStats Method

Note
computeStats is part of LeafNode Contract to compute statistics for cost-based optimizer.

computeStats…​FIXME

Creating InMemoryRelation Instance

InMemoryRelation takes the following when created:

  • Output schema attributes

  • useCompression flag

  • Batch size

  • Storage level

  • Child physical query plan

  • Table name (if used)

  • Cached column buffers (as RDD[CachedBatch])

  • Size in bytes statistic (as LongAccumulator)

  • Statistics of the child query plan

withOutput Method

withOutput…​FIXME

Note
withOutput is used exclusively when CacheManager is requested to replace logical query segments with cached query plans.

newInstance Method

Note
newInstance is part of MultiInstanceRelation Contract to…​FIXME.

newInstance…​FIXME

cachedColumnBuffers Method

cachedColumnBuffers…​FIXME

Note
cachedColumnBuffers is used when…​FIXME

PartitionStatistics

Note
PartitionStatistics is a private[columnar] class.

PartitionStatistics…​FIXME

Note
PartitionStatistics is used exclusively when InMemoryRelation is created (and initializes partitionStatistics).

关注公众号:spark技术分享

联系我们联系我们