关注 spark技术分享,
撸spark源码 玩spark最佳实践

MultiInstanceRelation

admin阅读(1166)

MultiInstanceRelation

MultiInstanceRelation is a contact of logical operators which a single instance might appear multiple times in a logical query plan.

When ResolveReferences logical evaluation is executed, every MultiInstanceRelation in a logical query plan is requested to produce a new version of itself with globally unique expression ids.

Table 1. MultiInstanceRelations
MultiInstanceRelation Description

ContinuousExecutionRelation

Used in Spark Structured Streaming

DataSourceV2Relation

ExternalRDD

HiveTableRelation

InMemoryRelation

LocalRelation

LogicalRDD

LogicalRelation

Range

View

StreamingExecutionRelation

Used in Spark Structured Streaming

StreamingRelation

Used in Spark Structured Streaming

StreamingRelationV2

Used in Spark Structured Streaming

CreateStruct Function Builder

admin阅读(1168)

CreateStruct Function Builder

CreateStruct is a function builder (e.g. Seq[Expression] ⇒ Expression) that can create CreateNamedStruct expressions and is the metadata of the struct function.

Metadata of struct Function — registryEntry Property

registryEntry…​FIXME

Note
registryEntry is used exclusively when FunctionRegistry is requested for the function expression registry.

Creating CreateNamedStruct Expression — apply Method

Note
apply is part of Scala’s scala.Function1 contract to create a function of one parameter (e.g. Seq[Expression]).

apply creates a CreateNamedStruct expression with the input children expressions as follows:

  • For NamedExpression expressions that are resolved, apply creates a pair of a Literal expression (with the name of the NamedExpression) and the NamedExpression itself

  • For NamedExpression expressions that are not resolved yet, apply creates a pair of a NamePlaceholder expression and the NamedExpression itself

  • For all other expressions, apply creates a pair of a Literal expression (with the value as col[index]) and the Expression itself

Note

apply is used when:

ScalaReflection

admin阅读(1225)

ScalaReflection

ScalaReflection is the contract and the only implementation of the contract with…​FIXME

serializerFor Object Method

serializerFor firstly finds the local type of the input type T and then the class name.

serializerFor uses the internal version of itself with the input inputObject expression, the tpe type and the walkedTypePath with the class name found earlier (of the input type T).

In the end, serializerFor returns one of the following:

Note
serializerFor is used when…​FIXME

serializerFor Internal Method

serializerFor…​FIXME

Note
serializerFor is used exclusively when ScalaReflection is requested to serializerFor.

localTypeOf Object Method

localTypeOf…​FIXME

Note
localTypeOf is used when…​FIXME

getClassNameFromType Object Method

getClassNameFromType…​FIXME

Note
getClassNameFromType is used when…​FIXME

definedByConstructorParams Object Method

definedByConstructorParams…​FIXME

Note
definedByConstructorParams is used when…​FIXME

AggUtils Helper Object

admin阅读(1448)

AggUtils Helper Object

AggUtils is a Scala object that defines the methods used exclusively when Aggregation execution planning strategy is executed.

planAggregateWithOneDistinct Method

planAggregateWithOneDistinct…​FIXME

Note
planAggregateWithOneDistinct is used exclusively when Aggregation execution planning strategy is executed.

Creating Physical Plan with Two Aggregate Physical Operators for Partial and Final Aggregations — planAggregateWithoutDistinct Method

planAggregateWithoutDistinct is a two-step physical operator generator.

planAggregateWithoutDistinct first creates an aggregate physical operator with aggregateExpressions in Partial mode (for partial aggregations).

Note
requiredChildDistributionExpressions for the aggregate physical operator for partial aggregation “stage” is empty.

In the end, planAggregateWithoutDistinct creates another aggregate physical operator (of the same type as before), but aggregateExpressions are now in Final mode (for final aggregations). The aggregate physical operator becomes the parent of the first aggregate operator.

Note
requiredChildDistributionExpressions for the parent aggregate physical operator for final aggregation “stage” are the attributes of groupingExpressions.
Note
planAggregateWithoutDistinct is used exclusively when Aggregation execution planning strategy is executed (with no AggregateExpressions being distinct).

Creating Aggregate Physical Operator — createAggregate Internal Method

createAggregate creates a physical operator given the input aggregateExpressions aggregate expressions.

Table 1. createAggregate’s Aggregate Physical Operator Selection Criteria (in execution order)
Aggregate Physical Operator Selection Criteria

HashAggregateExec

HashAggregateExec supports all aggBufferAttributes of the input aggregateExpressions aggregate expressions.

ObjectHashAggregateExec

  1. spark.sql.execution.useObjectHashAggregateExec internal flag enabled (it is by default)

  2. ObjectHashAggregateExec supports the input aggregateExpressions aggregate expressions.

SortAggregateExec

When all the above requirements could not be met.

Note
createAggregate is used when AggUtils is requested to planAggregateWithoutDistinct, planAggregateWithOneDistinct (and planStreamingAggregation for Spark Structured Streaming)

SchemaUtils Helper Object

admin阅读(1453)

SchemaUtils Helper Object

SchemaUtils is a Scala object that is used for the following:

checkColumnNameDuplication Method

  1. Uses the other checkColumnNameDuplication with caseSensitiveAnalysis flag per isCaseSensitiveAnalysis

checkColumnNameDuplication…​FIXME

Note
checkColumnNameDuplication is used when…​FIXME

checkSchemaColumnNameDuplication Method

checkSchemaColumnNameDuplication…​FIXME

Note
checkSchemaColumnNameDuplication is used when…​FIXME

isCaseSensitiveAnalysis Internal Method

isCaseSensitiveAnalysis…​FIXME

Note
isCaseSensitiveAnalysis is used when…​FIXME

PredicateHelper Scala Trait

admin阅读(1624)

PredicateHelper Scala Trait

PredicateHelper defines the methods that are used to work with predicates (mainly).

Table 1. PredicateHelper’s Methods
Method Description

splitConjunctivePredicates

splitDisjunctivePredicates

replaceAlias

canEvaluate

canEvaluateWithinJoin

Splitting Conjunctive Predicates — splitConjunctivePredicates Method

splitConjunctivePredicates takes the input condition expression and splits it to two expressions if they are children of a And binary expression.

splitConjunctivePredicates splits the child expressions recursively down the child expressions until no conjunctive And binary expressions exist.

splitDisjunctivePredicates Method

splitDisjunctivePredicates…​FIXME

Note
splitDisjunctivePredicates is used when…​FIXME

replaceAlias Method

replaceAlias…​FIXME

Note
replaceAlias is used when…​FIXME

canEvaluate Method

canEvaluate…​FIXME

Note
canEvaluate is used when…​FIXME

canEvaluateWithinJoin Method

canEvaluateWithinJoin indicates whether a Catalyst expression can be evaluated within a join, i.e. when one of the following conditions holds:

  • Expression is deterministic

  • Expression is not Unevaluable, ListQuery or Exists

  • Expression is a SubqueryExpression with no child expressions

  • Expression is a AttributeReference

  • Any expression with child expressions that meet one of the above conditions

Note

canEvaluateWithinJoin is used when:

  • PushPredicateThroughJoin logical optimization rule is executed

  • ReorderJoin logical optimization rule does createOrderedJoin

SubExprUtils Helper Object

admin阅读(905)

SubExprUtils Helper Object

SubExprUtils is a Scala object that is used for…​FIXME

SubExprUtils uses PredicateHelper for…​FIXME

Checking If Condition Expression Has Any Null-Aware Predicate Subqueries Inside Not — hasNullAwarePredicateWithinNot Method

hasNullAwarePredicateWithinNot splits conjunctive predicates (i.e. expressions separated by And expression).

hasNullAwarePredicateWithinNot is positive (i.e. true) and is considered to have a null-aware predicate subquery inside a Not expression when conjuctive predicate expressions include a Not expression with an In predicate expression with a ListQuery subquery expression.

hasNullAwarePredicateWithinNot is negative (i.e. false) for all the other expressions and in particular the following expressions:

  1. Exists predicate subquery expressions

  2. Not expressions with a Exists predicate subquery expression as the child expression

  3. In expressions with a ListQuery subquery expression as the list expression

  4. Not expressions with a In expression (with a ListQuery subquery expression as the list expression)

Note
hasNullAwarePredicateWithinNot is used exclusively when CheckAnalysis analysis validation is requested to validate analysis of a logical plan (with Filter logical operators).

StatFunctions Helper Object

admin阅读(1219)

StatFunctions Helper Object

StatFunctions is a Scala object that defines the methods that are used for…​FIXME

Table 1. StatFunctions API
Method Description

calculateCov

crossTabulate

multipleApproxQuantiles

pearsonCorrelation

summary

calculateCov Method

calculateCov…​FIXME

Note
calculateCov is used when…​FIXME

crossTabulate Method

crossTabulate…​FIXME

Note
crossTabulate is used when…​FIXME

multipleApproxQuantiles Method

multipleApproxQuantiles…​FIXME

Note
multipleApproxQuantiles is used when…​FIXME

pearsonCorrelation Method

pearsonCorrelation…​FIXME

Note
pearsonCorrelation is used when…​FIXME

Calculating Statistics For Dataset — summary Method

summary…​FIXME

Note
summary is used exclusively when Dataset.summary action is used.

CatalystTypeConverters Helper Object

admin阅读(1344)

CatalystTypeConverters Helper Object

CatalystTypeConverters is a Scala object that is used to convert Scala types to Catalyst types and vice versa.

createToCatalystConverter Method

createToCatalystConverter…​FIXME

Note
createToCatalystConverter is used when…​FIXME

convertToCatalyst Method

convertToCatalyst…​FIXME

Note
convertToCatalyst is used when…​FIXME

RDDConversions Helper Object

admin阅读(3648)

RDDConversions Helper Object

RDDConversions is a Scala object that is used to productToRowRdd and rowToRowRdd methods.

productToRowRdd Method

productToRowRdd…​FIXME

Note
productToRowRdd is used when…​FIXME

Converting Scala Objects In Rows to Values Of Catalyst Types — rowToRowRdd Method

rowToRowRdd maps over partitions of the input RDD[Row] (using RDD.mapPartitions operator) that creates a MapPartitionsRDD with a “map” function.

Tip
Use RDD.toDebugString to see the additional MapPartitionsRDD in an RDD lineage.

The “map” function takes a Scala Iterator of Row objects and does the following:

  1. Creates a GenericInternalRow (of the size that is the number of columns per the input Seq[DataType])

  2. Creates a converter function for every DataType in Seq[DataType]

  3. For every Row object in the partition (iterator), applies the converter function per position and adds the result value to the GenericInternalRow

  4. In the end, returns a GenericInternalRow for every row

Note
rowToRowRdd is used exclusively when DataSourceStrategy execution planning strategy is executed (and requested to toCatalystRDD).

关注公众号:spark技术分享

联系我们联系我们