关注 spark技术分享,
撸spark源码 玩spark最佳实践

SizeEstimator

admin阅读(1819)

SizeEstimator

SizeEstimator is…​FIXME

estimate Method

  1. Uses estimate with an empty IdentityHashMap

estimate…​FIXME

Note

estimate is used when:

sampleArray Internal Method

sampleArray…​FIXME

Note
sampleArray is used when…​FIXME

visitSingleObject Internal Method

visitSingleObject…​FIXME

Note
visitSingleObject is used exclusively when SizeEstimator is requested to estimate.

KnownSizeEstimation

admin阅读(1619)

KnownSizeEstimation

KnownSizeEstimation is the contract that allows a class to give SizeEstimator a more accurate size estimation.

KnownSizeEstimation defines the single estimatedSize method.

estimatedSize is used when:

Note
KnownSizeEstimation is a private[spark] contract.
Note
HashedRelation is the only KnownSizeEstimation available.

UnsafeHashedRelation

admin阅读(1601)

UnsafeHashedRelation

UnsafeHashedRelation is…​FIXME

get Method

Note
get is part of HashedRelation Contract to give the internal rows for the given key or null.

get…​FIXME

Getting Value Row for Given Key — getValue Method

Note
getValue is part of HashedRelation Contract to give the value internal row for a given key.

getValue…​FIXME

Creating UnsafeHashedRelation Instance — apply Factory Method

apply…​FIXME

Note
apply is used when…​FIXME

LongHashedRelation

admin阅读(1895)

LongHashedRelation

LongHashedRelation is a HashedRelation that is used when HashedRelation is requested for a concrete HashedRelation instance when the single key is of type long.

LongHashedRelation is also a Java Externalizable, i.e. when persisted, only the identity is written in the serialization stream and it is the responsibility of the class to save and restore the contents of its instances.

LongHashedRelation is created when:

  1. HashedRelation is requested for a concrete HashedRelation (and apply factory method is used)

  2. LongHashedRelation is requested for a read-only copy (when BroadcastHashJoinExec is requested to execute)

writeExternal Method

Note
writeExternal is part of Java’s Externalizable Contract to…​FIXME.

writeExternal…​FIXME

Note
writeExternal is used when…​FIXME

readExternal Method

Note
readExternal is part of Java’s Externalizable Contract to…​FIXME.

readExternal…​FIXME

Note
readExternal is used when…​FIXME

Creating LongHashedRelation Instance

LongHashedRelation takes the following when created:

  • Number of fields

  • LongToUnsafeRowMap

LongHashedRelation initializes the internal registries and counters.

Creating Read-Only Copy of LongHashedRelation — asReadOnlyCopy Method

Note
asReadOnlyCopy is part of HashedRelation Contract to…​FIXME.

asReadOnlyCopy…​FIXME

Getting Value Row for Given Key — getValue Method

Note
getValue is part of HashedRelation Contract to give the value internal row for a given key.

getValue checks if the input key is null at 0 position and if so gives null. Otherwise, getValue takes the long value at position 0 and gets the value.

Creating LongHashedRelation Instance — apply Factory Method

apply…​FIXME

Note
apply is used exclusively when HashedRelation is requested for a concrete HashedRelation.

HashedRelation

admin阅读(1864)

HashedRelation

HashedRelation is the contract for “relations” with values hashed by some key.

HashedRelation is a KnownSizeEstimation.

Note
HashedRelation is a private[execution] contract.
Table 1. HashedRelation Contract
Method Description

asReadOnlyCopy

Gives a read-only copy of this HashedRelation to be safely used in a separate thread.

Used exclusively when BroadcastHashJoinExec is requested to execute (and transform every partitions of streamedPlan physical operator using the broadcast variable of buildPlan physical operator).

get

Gives internal rows for the given key or null

Used when HashJoin is requested to innerJoin, outerJoin, semiJoin, existenceJoin and antiJoin.

getValue

Gives the value internal row for a given key

Note
HashedRelation has two variants of getValue, i.e. one that accepts an InternalRow and another a Long. getValue with an InternalRow does not seem to be used at all.

getAverageProbesPerLookup

Used when…​FIXME

getValue Method

Note
This is getValue that takes a long key. There is the more generic getValue that takes an internal row instead.

getValue simply reports an UnsupportedOperationException (and expects concrete HashedRelations to provide a more meaningful implementation).

Note
getValue is used exclusively when LongHashedRelation is requested to get the value for a given key.

Creating Concrete HashedRelation Instance (for Build Side of Hash-based Join) — apply Factory Method

apply creates a LongHashedRelation when the input key collection has a single expression of type long or UnsafeHashedRelation otherwise.

Note

The input key expressions are:

Note

apply is used when:

HashJoin — Contract for Hash-based Join Physical Operators

admin阅读(1857)

HashJoin — Contract for Hash-based Join Physical Operators

HashJoin is the contract for hash-based join physical operators (e.g. BroadcastHashJoinExec and ShuffledHashJoinExec).

Table 1. HashJoin Contract
Method Description

buildSide

Left or right build side

Used when:

joinType

JoinType

Table 2. HashJoin’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

boundCondition

buildKeys

Build join keys (as Catalyst expressions)

buildPlan

streamedKeys

Streamed join keys (as Catalyst expressions)

streamedPlan

join Method

join branches off per joinType to create a join iterator of internal rows (i.e. Iterator[InternalRow]) for the input streamedIter and hashed:

join requests TaskContext to add a TaskCompletionListener to update the input avg hash probe SQL metric. The TaskCompletionListener is executed on a task completion (regardless of the task status: success, failure, or cancellation) and uses getAverageProbesPerLookup from the input hashed to set the input avg hash probe.

In the end, for every row in the join iterator of internal rows join increments the input numOutputRows SQL metric and applies the result projection.

join reports a IllegalArgumentException when the joinType is incorrect.

Note
join is used when BroadcastHashJoinExec and ShuffledHashJoinExec are executed.

innerJoin Internal Method

innerJoin…​FIXME

Note
innerJoin is used when…​FIXME

outerJoin Internal Method

outerJoin…​FIXME

Note
outerJoin is used when…​FIXME

semiJoin Internal Method

semiJoin…​FIXME

Note
semiJoin is used when…​FIXME

antiJoin Internal Method

antiJoin…​FIXME

Note
antiJoin is used when…​FIXME

existenceJoin Internal Method

existenceJoin…​FIXME

Note
existenceJoin is used when…​FIXME

createResultProjection Method

createResultProjection…​FIXME

Note
createResultProjection is used when…​FIXME

PhysicalOperation — Scala Extractor for Destructuring Logical Query Plans

admin阅读(4233)

PhysicalOperation — Scala Extractor for Destructuring Logical Query Plans

PhysicalOperation is a Scala extractor to destructure a logical query plan into a tuple with the following elements:

  1. Named expressions (aka projects)

  2. Expressions (aka filters)

  3. Logical operator (aka leaf node)

The following idiom is often used in Strategy implementations (e.g. HiveTableScans, InMemoryScans, DataSourceStrategy, FileSourceStrategy):

Whenever used to pattern match to a LogicalPlan, PhysicalOperation‘s unapply is called.

unapply Method

unapply…​FIXME

Note
unapply is almost collectProjectsAndFilters method itself (with some manipulations of the return value).
Note

unapply is used when…​FIXME

PhysicalAggregation — Scala Extractor for Destructuring Aggregate Logical Operators

admin阅读(1866)

PhysicalAggregation — Scala Extractor for Destructuring Aggregate Logical Operators

PhysicalAggregation is a Scala extractor to destructure an Aggregate logical operator into a four-element tuple with the following elements:

Tip
See the document about Scala extractor objects.

Destructuring Aggregate Logical Operator — unapply Method

unapply destructures the input a Aggregate logical operator into a four-element ReturnType.

Note

unapply is used when…​FIXME

TypeCoercionRule — Contract For Type Coercion Rules

admin阅读(1966)

TypeCoercionRule Contract — Type Coercion Rules

TypeCoercionRule is the contract of logical rules to coerce and propagate types in logical plans.

Table 1. (Subset of) TypeCoercionRule Contract
Method Description

coerceTypes

Coerce types in a logical plan

Used exclusively when TypeCoercionRule is executed

TypeCoercionRule is simply a Catalyst rule for transforming logical plans, i.e. Rule[LogicalPlan].

Table 2. TypeCoercionRules
TypeCoercionRule Description

CaseWhenCoercion

ConcatCoercion

DecimalPrecision

Division

EltCoercion

FunctionArgumentConversion

IfCoercion

ImplicitTypeCasts

InConversion

PromoteStrings

StackCoercion

WindowFrameCoercion

Executing Rule — apply Method

Note
apply is part of the Rule Contract to execute (apply) a rule on a TreeNode (e.g. LogicalPlan).

apply coerceTypes in the input LogicalPlan and returns the following:

propagateTypes Internal Method

propagateTypes…​FIXME

Note
propagateTypes is used exclusively when TypeCoercionRule is executed.

TypeCoercion Object

admin阅读(1490)

TypeCoercion Object

TypeCoercion is a Scala object that defines the type coercion rules for Spark Analyzer.

Defining Type Coercion Rules (For Spark Analyzer) — typeCoercionRules Method

typeCoercionRules is a collection of Catalyst rules to transform logical plans (in the order of execution):

  1. InConversion

  2. WidenSetOperationTypes

  3. PromoteStrings

  4. DecimalPrecision

  5. BooleanEquality

  6. FunctionArgumentConversion

  7. ConcatCoercion

  8. EltCoercion

  9. CaseWhenCoercion

  10. IfCoercion

  11. StackCoercion

  12. Division

  13. ImplicitTypeCasts

  14. DateTimeOperations

  15. WindowFrameCoercion

Note
typeCoercionRules is used exclusively when Analyzer is requested for batches.

关注公众号:spark技术分享

联系我们联系我们