关注 spark技术分享,
撸spark源码 玩spark最佳实践

Repartition and RepartitionByExpression

admin阅读(4916)

Repartition Logical Operators — Repartition and RepartitionByExpression

Repartition and RepartitionByExpression (repartition operations in short) are unary logical operators that create a new RDD that has exactly numPartitions partitions.

Note
RepartitionByExpression is also called distribute operator.

Repartition is the result of coalesce or repartition (with no partition expressions defined) operators.

RepartitionByExpression is the result of the following operators:

Repartition and RepartitionByExpression logical operators are described by:

  • shuffle flag

  • target number of partitions

Note
BasicOperators strategy resolves Repartition to ShuffleExchangeExec (with RoundRobinPartitioning partitioning scheme) or CoalesceExec physical operators per shuffle — enabled or not, respectively.
Note
BasicOperators strategy resolves RepartitionByExpression to ShuffleExchangeExec physical operator with HashPartitioning partitioning scheme.

Repartition Operation Optimizations

Project

admin阅读(2440)

Project Unary Logical Operator

Project is a unary logical operator that takes the following when created:

Project is created to represent the following:

  • Dataset operators, i.e. joinWith, select (incl. selectUntyped), unionByName

  • KeyValueGroupedDataset operators, i.e. keys, mapValues

  • CreateViewCommand logical command is executed (and aliasPlan)

  • SQL’s SELECT queries with named expressions

Project can also appear in a logical plan after analysis or optimization phases.

Note
Nondeterministic expressions are allowed in Project logical operator and enforced by CheckAnalysis.

The output schema of a Project is…​FIXME

maxRows…​FIXME

resolved…​FIXME

validConstraints…​FIXME

Tip

Use select operator from Catalyst DSL to create a Project logical operator, e.g. for testing or Spark SQL internals exploration.

Pivot

admin阅读(1478)

Pivot Unary Logical Operator

Pivot is a unary logical operator that represents pivot operator.

Pivot is created when RelationalGroupedDataset creates a DataFrame for an aggregate operator.

Analysis Phase

Pivot operator is resolved at analysis phase in the following logical evaluation rules:

Pivot operator “disappears” behind (i.e. is converted to) a Aggregate logical operator (possibly under Project operator).

Creating Pivot Instance

Pivot takes the following when created:

LogicalRelation

admin阅读(1660)

LogicalRelation Leaf Logical Operator — Representing BaseRelations in Logical Plan

LogicalRelation is a leaf logical operator that represents a BaseRelation in a logical query plan.

LogicalRelation is created when:

Note

LogicalRelation can be created using apply factory methods that accept BaseRelation with optional CatalogTable.

The simple text representation of a LogicalRelation (aka simpleString) is Relation[output] [relation] (that uses the output and BaseRelation).

refresh Method

Note
refresh is part of LogicalPlan Contract to refresh itself.

refresh requests the FileIndex of a HadoopFsRelation relation to refresh.

Note
refresh does the work for HadoopFsRelation relations only.

Creating LogicalRelation Instance

LogicalRelation takes the following when created:

LogicalRDD

admin阅读(1468)

LogicalRDD — Logical Scan Over RDD

LogicalRDD is a leaf logical operator with MultiInstanceRelation support for a logical representation of a scan over RDD of internal binary rows.

LogicalRDD is created when:

Note
LogicalRDD is resolved to RDDScanExec when BasicOperators execution planning strategy is executed.

newInstance Method

Note
newInstance is part of MultiInstanceRelation Contract to…​FIXME.

newInstance…​FIXME

Computing Statistics — computeStats Method

Note
computeStats is part of LeafNode Contract to compute statistics for cost-based optimizer.

computeStats…​FIXME

Creating LogicalRDD Instance

LogicalRDD takes the following when created:

LocalRelation

admin阅读(1577)

LocalRelation Leaf Logical Operator

LocalRelation is a leaf logical operator that allow functions like collect or take to be executed locally, i.e. without using Spark executors.

LocalRelation is created when…​FIXME

Note
When Dataset operators can be executed locally, the Dataset is considered local.

LocalRelation represents Datasets that were created from local collections using SparkSession.emptyDataset or SparkSession.createDataset methods and their derivatives like toDF.

It can only be constructed with the output attributes being all resolved.

The size of the objects (in statistics) is the sum of the default size of the attributes multiplied by the number of records.

When executed, LocalRelation is translated to LocalTableScanExec physical operator.

Creating LocalRelation Instance

LocalRelation takes the following when created:

LocalRelation initializes the internal registries and counters.

LeafNode

admin阅读(1481)

LeafNode — Base Logical Operator with No Child Operators and Optional Statistics

LeafNode is the base of logical operators that have no child operators.

LeafNode that wants to survive analysis has to define computeStats as it throws an UnsupportedOperationException by default.

Table 1. LeafNodes (Direct Implementations)
LeafNode Description

AnalysisBarrier

DataSourceV2Relation

ExternalRDD

HiveTableRelation

InMemoryRelation

LocalRelation

LogicalRDD

LogicalRelation

OneRowRelation

Range

UnresolvedCatalogRelation

UnresolvedInlineTable

UnresolvedRelation

UnresolvedTableValuedFunction

Computing Statistics — computeStats Method

computeStats simply throws an UnsupportedOperationException.

Note
Logical operators, e.g. ExternalRDD, LogicalRDD and DataSourceV2Relation, or relations, e.g. HadoopFsRelation or BaseRelation, use spark.sql.defaultSizeInBytes internal property for the default estimated size if the statistics could not be computed.
Note
computeStats is used exclusively when SizeInBytesOnlyStatsPlanVisitor uses the default case to compute the size statistic (in bytes) for a logical operator.

关注公众号:spark技术分享

联系我们联系我们