spark-sql-spark技术分享-第11页

ReuseSubquery

2013-04-22admin阅读(1736)

ReuseSubquery Physical Query Optimization

ReuseSubquery is a physical query optimization (aka physical query preparation rule or simply preparation rule) that QueryExecution uses to optimize the physical plan of a structured query by FIXME.

Technically, ReuseSubquery is just a Catalyst rule for transforming physical query plans, i.e. Rule[SparkPlan].

ReuseSubquery is part of preparations batch of physical query plan rules and is executed when QueryExecution is requested for the optimized physical query plan (i.e. in executedPlan phase of a query execution).

`apply` Method



apply(plan: SparkPlan): SparkPlan

apply(plan: SparkPlan): SparkPlan

Note	`apply` is part of Rule Contract to apply a rule to a physical plan.

apply…FIXME

ReuseExchange

2013-04-21admin阅读(1704)

ReuseExchange Physical Query Optimization

ReuseExchange is a physical query optimization (aka physical query preparation rule or simply preparation rule) that QueryExecution uses to optimize the physical plan of a structured query by FIXME.

Technically, ReuseExchange is just a Catalyst rule for transforming physical query plans, i.e. Rule[SparkPlan].

ReuseExchange is part of preparations batch of physical query plan rules and is executed when QueryExecution is requested for the optimized physical query plan (i.e. in executedPlan phase of a query execution).

`apply` Method



apply(plan: SparkPlan): SparkPlan

apply(plan: SparkPlan): SparkPlan

Note	`apply` is part of Rule Contract to apply a rule to a physical plan.

apply finds all Exchange unary operators and…FIXME

apply does nothing and simply returns the input physical plan if spark.sql.exchange.reuse internal configuration property is off (i.e. false).

Note	spark.sql.exchange.reuse internal configuration property is on (i.e. `true`) by default.

PlanSubqueries

2013-04-20admin阅读(1958)

PlanSubqueries Physical Query Optimization

PlanSubqueries is a physical query optimization (aka physical query preparation rule or simply preparation rule) that plans ScalarSubquery (SubqueryExpression) expressions (as ScalarSubquery ExecSubqueryExpression expressions).



import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.execution.PlanSubqueries
val planSubqueries = PlanSubqueries(spark)

Seq(
  (0, 0),
  (1, 0),
  (2, 1)
).toDF("id", "gid").createOrReplaceTempView("t")

Seq(
  (0, 3),
  (1, 20)
).toDF("gid", "lvl").createOrReplaceTempView("v")

val sql = """
  select * from t where gid > (select max(gid) from v)
"""
val q = spark.sql(sql)

val sparkPlan = q.queryExecution.sparkPlan
scala> println(sparkPlan.numberedTreeString)
00 Project [_1#49 AS id#52, _2#50 AS gid#53]
01 +- Filter (_2#50 > scalar-subquery#128 [])
02    :  +- Aggregate [max(gid#61) AS max(gid)#130]
03    :     +- LocalRelation [gid#61]
04    +- LocalTableScan [_1#49, _2#50]

val optimizedPlan = planSubqueries(sparkPlan)
scala> println(optimizedPlan.numberedTreeString)
00 Project [_1#49 AS id#52, _2#50 AS gid#53]
01 +- Filter (_2#50 > Subquery subquery128)
02    :  +- Subquery subquery128
03    :     +- *(2) HashAggregate(keys=[], functions=[max(gid#61)], output=[max(gid)#130])
04    :        +- Exchange SinglePartition
05    :           +- *(1) HashAggregate(keys=[], functions=[partial_max(gid#61)], output=[max#134])
06    :              +- LocalTableScan [gid#61]
07    +- LocalTableScan [_1#49, _2#50]

import org.apache.spark.sql.SparkSession

val spark: SparkSession = ...

import org.apache.spark.sql.execution.PlanSubqueries

val planSubqueries = PlanSubqueries(spark)

Seq(

(0, 0),

(1, 0),

(2, 1)

).toDF("id", "gid").createOrReplaceTempView("t")

Seq(

(0, 3),

(1, 20)

).toDF("gid", "lvl").createOrReplaceTempView("v")

val sql = """

select * from t where gid > (select max(gid) from v)

"""

val q = spark.sql(sql)

val sparkPlan = q.queryExecution.sparkPlan

scala> println(sparkPlan.numberedTreeString)

00 Project [_1#49 AS id#52, _2#50 AS gid#53]

01 +- Filter (_2#50 > scalar-subquery#128 [])

02 : +- Aggregate [max(gid#61) AS max(gid)#130]

03 : +- LocalRelation [gid#61]

04 +- LocalTableScan [_1#49, _2#50]

val optimizedPlan = planSubqueries(sparkPlan)

scala> println(optimizedPlan.numberedTreeString)

00 Project [_1#49 AS id#52, _2#50 AS gid#53]

01 +- Filter (_2#50 > Subquery subquery128)

02 : +- Subquery subquery128

03 : +- *(2) HashAggregate(keys=[], functions=[max(gid#61)], output=[max(gid)#130])

04 : +- Exchange SinglePartition

05 : +- *(1) HashAggregate(keys=[], functions=[partial_max(gid#61)], output=[max#134])

06 : +- LocalTableScan [gid#61]

07 +- LocalTableScan [_1#49, _2#50]

PlanSubqueries is part of preparations batch of physical query plan rules and is executed when QueryExecution is requested for the optimized physical query plan (i.e. in executedPlan phase of a query execution).

Technically, PlanSubqueries is just a Catalyst rule for transforming physical query plans, i.e. Rule[SparkPlan].

Applying PlanSubqueries Rule to Physical Plan (Executing PlanSubqueries) — `apply` Method



apply(plan: SparkPlan): SparkPlan

apply(plan: SparkPlan): SparkPlan

Note	`apply` is part of Rule Contract to apply a rule to a TreeNode, e.g. physical plan.

For every ScalarSubquery (SubqueryExpression) expression in the input physical plan, apply does the following:

Builds the optimized physical plan (aka executedPlan) of the subquery logical plan, i.e. creates a QueryExecution for the subquery logical plan and requests the optimized physical plan.
Plans the scalar subquery, i.e. creates a ScalarSubquery (ExecSubqueryExpression) expression with a new SubqueryExec physical operator (with the name subquery[id] and the optimized physical plan) and the ExprId.

ExtractPythonUDFs

2013-04-19admin阅读(2665)

ExtractPythonUDFs Physical Query Optimization

ExtractPythonUDFs is a physical query optimization (aka physical query preparation rule or simply preparation rule) that QueryExecution uses to optimize the physical plan of a structured query by extracting Python UDFs from a physical query plan (excluding FlatMapGroupsInPandasExec operators that it simply skips over).

Technically, ExtractPythonUDFs is just a Catalyst rule for transforming physical query plans, i.e. Rule[SparkPlan].

ExtractPythonUDFs is part of preparations batch of physical query plan rules and is executed when QueryExecution is requested for the optimized physical query plan (i.e. in executedPlan phase of a query execution).

Extracting Python UDFs from Physical Query Plan — `extract` Internal Method



extract(plan: SparkPlan): SparkPlan

extract(plan: SparkPlan): SparkPlan

extract…FIXME

Note	`extract` is used exclusively when `ExtractPythonUDFs` is requested to optimize a physical query plan.

`trySplitFilter` Internal Method



trySplitFilter(plan: SparkPlan): SparkPlan

trySplitFilter(plan: SparkPlan): SparkPlan

trySplitFilter…FIXME

Note	`trySplitFilter` is used exclusively when `ExtractPythonUDFs` is requested to extract.

EnsureRequirements

2013-04-18admin阅读(1883)

EnsureRequirements Physical Query Optimization

EnsureRequirements is a physical query optimization (aka physical query preparation rule or simply preparation rule) that QueryExecution uses to optimize the physical plan of a structured query by transforming the following physical operators (up the plan tree):

Removes two adjacent ShuffleExchangeExec physical operators if the child partitioning scheme guarantees the parent’s partitioning
For other non-ShuffleExchangeExec physical operators, ensures partition distribution and ordering (possibly adding new physical operators, e.g. BroadcastExchangeExec and ShuffleExchangeExec for distribution or SortExec for sorting)

Technically, EnsureRequirements is just a Catalyst rule for transforming physical query plans, i.e. Rule[SparkPlan].

EnsureRequirements is part of preparations batch of physical query plan rules and is executed when QueryExecution is requested for the optimized physical query plan (i.e. in executedPlan phase of a query execution).

EnsureRequirements takes a SQLConf when created.



val q = ??? // FIXME
val sparkPlan = q.queryExecution.sparkPlan

import org.apache.spark.sql.execution.exchange.EnsureRequirements
val plan = EnsureRequirements(spark.sessionState.conf).apply(sparkPlan)

val q = ??? // FIXME

val sparkPlan = q.queryExecution.sparkPlan

import org.apache.spark.sql.execution.exchange.EnsureRequirements

val plan = EnsureRequirements(spark.sessionState.conf).apply(sparkPlan)

`createPartitioning` Internal Method

Caution

FIXME

`defaultNumPreShufflePartitions` Internal Method

Caution

FIXME

Enforcing Partition Requirements (Distribution and Ordering) of Physical Operator — `ensureDistributionAndOrdering` Internal Method



ensureDistributionAndOrdering(operator: SparkPlan): SparkPlan

ensureDistributionAndOrdering(operator: SparkPlan): SparkPlan

Internally, ensureDistributionAndOrdering takes the following from the input physical operator:

required partition requirements for the children
required sort ordering per the required partition requirements per child
child physical plans

Note	The number of requirements for partitions and their sort ordering has to match the number and the order of the child physical plans.

ensureDistributionAndOrdering matches the operator’s required partition requirements of children (requiredChildDistributions) to the children’s output partitioning and (in that order):

If the child satisfies the requested distribution, the child is left unchanged
For BroadcastDistribution, the child becomes the child of BroadcastExchangeExec unary operator for broadcast hash joins
Any other pair of child and distribution leads to ShuffleExchangeExec unary physical operator (with proper partitioning for distribution and with spark.sql.shuffle.partitions number of partitions, i.e. 200 by default)

Note	ShuffleExchangeExec can appear in the physical plan when the children’s output partitioning cannot satisfy the physical operator’s required child distribution.

If the input operator has multiple children and specifies child output distributions, then the children’s output partitionings have to be compatible.

If the children’s output partitionings are not all compatible, then…FIXME

ensureDistributionAndOrdering adds ExchangeCoordinator (only when adaptive query execution is enabled which is not by default).

Note	At this point in `ensureDistributionAndOrdering` the required child distributions are already handled.

ensureDistributionAndOrdering matches the operator’s required sort ordering of children (requiredChildOrderings) to the children’s output partitioning and if the orderings do not match, SortExec unary physical operator is created as a new child.

In the end, ensureDistributionAndOrdering sets the new children for the input operator.

Note	`ensureDistributionAndOrdering` is used exclusively when `EnsureRequirements` is executed (i.e. applied to a physical plan).

Adding ExchangeCoordinator (Adaptive Query Execution) — `withExchangeCoordinator` Internal Method



withExchangeCoordinator(
  children: Seq[SparkPlan],
  requiredChildDistributions: Seq[Distribution]): Seq[SparkPlan]

withExchangeCoordinator(

children: Seq[SparkPlan],

requiredChildDistributions: Seq[Distribution]): Seq[SparkPlan]

withExchangeCoordinator adds ExchangeCoordinator to ShuffleExchangeExec operators if adaptive query execution is enabled (per spark.sql.adaptive.enabled property) and partitioning scheme of the ShuffleExchangeExec operators support ExchangeCoordinator.

Note	spark.sql.adaptive.enabled property is disabled by default.

Internally, withExchangeCoordinator checks if the input children operators support ExchangeCoordinator which is that either holds:

If there is at least one ShuffleExchangeExec operator, all children are either ShuffleExchangeExec with HashPartitioning or their output partitioning is HashPartitioning (even inside PartitioningCollection)
There are at least two children operators and the input requiredChildDistributions are all ClusteredDistribution

With adaptive query execution (i.e. when spark.sql.adaptive.enabled configuration property is true) and the operator supports ExchangeCoordinator, withExchangeCoordinator creates a ExchangeCoordinator and:

For every ShuffleExchangeExec, registers the ExchangeCoordinator
Creates HashPartitioning partitioning scheme with the default number of partitions to use when shuffling data for joins or aggregations (as spark.sql.shuffle.partitions which is 200 by default) and adds ShuffleExchangeExec to the final result (for the current physical operator)

Otherwise (when adaptive query execution is disabled or children do not support ExchangeCoordinator), withExchangeCoordinator returns the input children unchanged.

Note	`withExchangeCoordinator` is used exclusively for enforcing partition requirements of a physical operator.

`reorderJoinPredicates` Internal Method



reorderJoinPredicates(plan: SparkPlan): SparkPlan

reorderJoinPredicates(plan: SparkPlan): SparkPlan

reorderJoinPredicates…FIXME

Note	`reorderJoinPredicates` is used when…FIXME

SpecialLimits

2013-04-17admin阅读(1560)

SpecialLimits Execution Planning Strategy

SpecialLimits is an execution planning strategy that Spark Planner uses to FIXME.

Applying SpecialLimits Strategy to Logical Plan (Executing SpecialLimits) — `apply` Method



apply(plan: LogicalPlan): Seq[SparkPlan]

apply(plan: LogicalPlan): Seq[SparkPlan]

Note	`apply` is part of GenericStrategy Contract to generate a collection of SparkPlans for a given logical plan.

apply…FIXME

JoinSelection

2013-04-16admin阅读(1853)

JoinSelection Execution Planning Strategy

JoinSelection is an execution planning strategy that SparkPlanner uses to plan a Join logical operator to one of the supported join physical operators (as described by join physical operator selection requirements).

JoinSelection firstly considers join physical operators per whether join keys are used or not. When join keys are used, JoinSelection considers BroadcastHashJoinExec, ShuffledHashJoinExec or SortMergeJoinExec operators. Without join keys, JoinSelection considers BroadcastNestedLoopJoinExec or CartesianProductExec.

Table 1. Join Physical Operator Selection Requirements (in the order of preference)
Physical Join Operator	Selection Requirements
BroadcastHashJoinExec	There are join keys and one of the following holds: Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin (i.e. canBuildRight for the input `joinType` is positive) and right join side can be broadcast Join type is CROSS, INNER or RIGHT OUTER (i.e. canBuildLeft for the input `joinType` is positive) and left join side can be broadcast
ShuffledHashJoinExec	There are join keys and one of the following holds: spark.sql.join.preferSortMergeJoin is disabled, the join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin (i.e. canBuildRight for the input `joinType` is positive), canBuildLocalHashMap for right join side and finally right join side is much smaller than left side spark.sql.join.preferSortMergeJoin is disabled, the join type is CROSS, INNER or RIGHT OUTER (i.e. canBuildLeft for the input `joinType` is positive), canBuildLocalHashMap for left join side and finally left join side is much smaller than right Left join keys are not orderable
SortMergeJoinExec	Left join keys are orderable
BroadcastNestedLoopJoinExec	There are no join keys and one of the following holds: Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin (i.e. canBuildRight for the input `joinType` is positive) and right join side can be broadcast Join type is CROSS, INNER or RIGHT OUTER (i.e. canBuildLeft for the input `joinType` is positive) and left join side can be broadcast
CartesianProductExec	There are no join keys and join type is CROSS or INNER
BroadcastNestedLoopJoinExec	No other join operator has matched already

Note	`JoinSelection` uses ExtractEquiJoinKeys Scala extractor to destructure a `Join` logical operator.

Is Left-Side Plan At Least 3 Times Smaller Than Right-Side Plan? — `muchSmaller` Internal Condition



muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean

muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean

muchSmaller condition holds when plan a is at least 3 times smaller than plan b.

Internally, muchSmaller calculates the estimated statistics for the input logical plans and compares their physical size in bytes (sizeInBytes).

Note	`muchSmaller` is used when `JoinSelection` checks join selection requirements for `ShuffledHashJoinExec` physical operator.

`canBuildLocalHashMap` Internal Condition



canBuildLocalHashMap(plan: LogicalPlan): Boolean

canBuildLocalHashMap(plan: LogicalPlan): Boolean

canBuildLocalHashMap condition holds for the logical plan whose single partition is small enough to build a hash table (i.e. spark.sql.autoBroadcastJoinThreshold multiplied by spark.sql.shuffle.partitions).

Internally, canBuildLocalHashMap calculates the estimated statistics for the input logical plans and takes the size in bytes (sizeInBytes).

Note	`canBuildLocalHashMap` is used when `JoinSelection` checks join selection requirements for `ShuffledHashJoinExec` physical operator.

Can Logical Plan Be Broadcast? — `canBroadcast` Internal Condition



canBroadcast(plan: LogicalPlan): Boolean

canBroadcast(plan: LogicalPlan): Boolean

canBroadcast is enabled, i.e. true, when the size of the output of the input logical plan (aka sizeInBytes) is less than spark.sql.autoBroadcastJoinThreshold configuration property.

Note	spark.sql.autoBroadcastJoinThreshold is 10M by default.

Note	`canBroadcast` uses the total size statistic from Statistics of a logical operator.

Note	`canBroadcast` is used when `JoinSelection` is requested to canBroadcastBySizes and selects the build side per join type and total size statistic of join sides.

`canBroadcastByHints` Internal Method



canBroadcastByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): Boolean

canBroadcastByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): Boolean

canBroadcastByHints is positive (i.e. true) when either condition holds:

Join type is CROSS, INNER or RIGHT OUTER (i.e. canBuildLeft for the input joinType is positive) and left operator’s broadcast hint flag is on
Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin (i.e. canBuildRight for the input joinType is positive) and right operator’s broadcast hint flag is on

Otherwise, canBroadcastByHints is negative (i.e. false).

Note	`canBroadcastByHints` is used when `JoinSelection` is requested to plan a Join logical operator (and considers a BroadcastHashJoinExec or a BroadcastNestedLoopJoinExec physical operator).

Selecting Build Side Per Join Type and Broadcast Hints — `broadcastSideByHints` Internal Method



broadcastSideByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): BuildSide

broadcastSideByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): BuildSide

broadcastSideByHints computes buildLeft and buildRight flags:

buildLeft flag is positive (i.e. true) when the join type is CROSS, INNER or RIGHT OUTER (i.e. canBuildLeft for the input joinType is positive) and the left operator’s broadcast hint flag is positive
buildRight flag is positive (i.e. true) when the join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin (i.e. canBuildRight for the input joinType is positive) and the right operator’s broadcast hint flag is positive

In the end, broadcastSideByHints gives the join side to broadcast.

Note	`broadcastSideByHints` is used when `JoinSelection` is requested to plan a Join logical operator (and considers a BroadcastHashJoinExec or a BroadcastNestedLoopJoinExec physical operator).

Choosing Join Side to Broadcast — `broadcastSide` Internal Method



broadcastSide(
  canBuildLeft: Boolean,
  canBuildRight: Boolean,
  left: LogicalPlan,
  right: LogicalPlan): BuildSide

broadcastSide(

canBuildLeft: Boolean,

canBuildRight: Boolean,

left: LogicalPlan,

right: LogicalPlan): BuildSide

broadcastSide gives the smaller side (BuildRight or BuildLeft) per total size when canBuildLeft and canBuildRight are both positive (i.e. true).

broadcastSide gives BuildRight when canBuildRight is positive.

broadcastSide gives BuildLeft when canBuildLeft is positive.

When all the above conditions are not met, broadcastSide gives the smaller side (BuildRight or BuildLeft) per total size (similarly to the first case when canBuildLeft and canBuildRight are both positive).

Note	`broadcastSide` is used when `JoinSelection` is requested to broadcastSideByHints, select the build side per join type and total size statistic of join sides, and execute (and considers a BroadcastNestedLoopJoinExec physical operator).

Checking If Join Type Allows For Left Join Side As Build Side — `canBuildLeft` Internal Condition



canBuildLeft(joinType: JoinType): Boolean

canBuildLeft(joinType: JoinType): Boolean

canBuildLeft is positive (i.e. true) for CROSS, INNER and RIGHT OUTER join types. Otherwise, canBuildLeft is negative (i.e. false).

Note	`canBuildLeft` is used when `JoinSelection` is requested to canBroadcastByHints, broadcastSideByHints, canBroadcastBySizes, broadcastSideBySizes and execute (when selecting a [ShuffledHashJoinExec] physical operator).

Checking If Join Type Allows For Right Join Side As Build Side — `canBuildRight` Internal Condition



canBuildRight(joinType: JoinType): Boolean

canBuildRight(joinType: JoinType): Boolean

canBuildRight is positive (i.e. true) if the input join type is one of the following:

CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin

Otherwise, canBuildRight is negative (i.e. false).

Note	`canBuildRight` is used when `JoinSelection` is requested to canBroadcastByHints, broadcastSideByHints, canBroadcastBySizes, broadcastSideBySizes and execute (when selecting a [ShuffledHashJoinExec] physical operator).

Checking If Join Type and Total Size Statistic of Join Sides Allow for Broadcast Join — `canBroadcastBySizes` Internal Method



canBroadcastBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): Boolean

canBroadcastBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): Boolean

canBroadcastBySizes is positive (i.e. true) when either condition holds:

Join type is CROSS, INNER or RIGHT OUTER (i.e. canBuildLeft for the input joinType is positive) and left operator can be broadcast per total size statistic
Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin (i.e. canBuildRight for the input joinType is positive) and right operator can be broadcast per total size statistic

Otherwise, canBroadcastByHints is negative (i.e. false).

Note	`canBroadcastByHints` is used when `JoinSelection` is requested to plan a Join logical operator (and considers a BroadcastHashJoinExec or a BroadcastNestedLoopJoinExec physical operator).

Selecting Build Side Per Join Type and Total Size Statistic of Join Sides — `broadcastSideBySizes` Internal Method



broadcastSideBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): BuildSide

broadcastSideBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): BuildSide

broadcastSideBySizes computes buildLeft and buildRight flags:

buildLeft flag is positive (i.e. true) when the join type is CROSS, INNER or RIGHT OUTER (i.e. canBuildLeft for the input joinType is positive) and left operator can be broadcast per total size statistic
buildRight flag is positive (i.e. true) when the join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin (i.e. canBuildRight for the input joinType is positive) and right operator can be broadcast per total size statistic

In the end, broadcastSideByHints gives the join side to broadcast.

Note	`broadcastSideByHints` is used when `JoinSelection` is requested to plan a Join logical operator (and considers a BroadcastHashJoinExec or a BroadcastNestedLoopJoinExec physical operator).

Applying JoinSelection Strategy to Logical Plan (Executing JoinSelection) — `apply` Method



apply(plan: LogicalPlan): Seq[SparkPlan]

apply(plan: LogicalPlan): Seq[SparkPlan]

Note	`apply` is part of GenericStrategy Contract to generate a collection of SparkPlans for a given logical plan.

apply uses ExtractEquiJoinKeys Scala extractor to destructure the input logical plan.

Considering BroadcastHashJoinExec Physical Operator

apply gives a BroadcastHashJoinExec physical operator if the plan should be broadcast per join type and broadcast hints used (for the join type and left or right side of the join). apply selects the build side per join type and broadcast hints.

apply gives a BroadcastHashJoinExec physical operator if the plan should be broadcast per join type and size of join sides (for the join type and left or right side of the join). apply selects the build side per join type and total size statistic of join sides.

Considering ShuffledHashJoinExec Physical Operator

apply gives…FIXME

Considering SortMergeJoinExec Physical Operator

apply gives…FIXME

Considering BroadcastNestedLoopJoinExec Physical Operator

apply gives…FIXME

Considering CartesianProductExec Physical Operator

apply gives…FIXME

InMemoryScans

2013-04-15admin阅读(2428)

InMemoryScans Execution Planning Strategy

InMemoryScans is an execution planning strategy that plans InMemoryRelation logical operators to InMemoryTableScanExec physical operators.



val spark: SparkSession = ...
// query uses InMemoryRelation logical operator
val q = spark.range(5).cache
val plan = q.queryExecution.optimizedPlan
scala> println(plan.numberedTreeString)
00 InMemoryRelation [id#208L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
01    +- *Range (0, 5, step=1, splits=8)

// InMemoryScans is an internal class of SparkStrategies
import spark.sessionState.planner.InMemoryScans
val physicalPlan = InMemoryScans.apply(plan).head
scala> println(physicalPlan.numberedTreeString)
00 InMemoryTableScan [id#208L]
01    +- InMemoryRelation [id#208L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
02          +- *Range (0, 5, step=1, splits=8)

val spark: SparkSession = ...

// query uses InMemoryRelation logical operator

val q = spark.range(5).cache

val plan = q.queryExecution.optimizedPlan

scala> println(plan.numberedTreeString)

00 InMemoryRelation [id#208L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)

01 +- *Range (0, 5, step=1, splits=8)

// InMemoryScans is an internal class of SparkStrategies

import spark.sessionState.planner.InMemoryScans

val physicalPlan = InMemoryScans.apply(plan).head

scala> println(physicalPlan.numberedTreeString)

00 InMemoryTableScan [id#208L]

01 +- InMemoryRelation [id#208L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)

02 +- *Range (0, 5, step=1, splits=8)

InMemoryScans is part of the standard execution planning strategies of SparkPlanner.

Applying InMemoryScans Strategy to Logical Plan (Executing InMemoryScans) — `apply` Method



apply(plan: LogicalPlan): Seq[SparkPlan]

apply(plan: LogicalPlan): Seq[SparkPlan]

Note	`apply` is part of GenericStrategy Contract to generate a collection of SparkPlans for a given logical plan.

apply requests PhysicalOperation extractor to destructure the input logical plan to a InMemoryRelation logical operator.

In the end, apply pruneFilterProject with a new InMemoryTableScanExec physical operator.

HiveTableScans

2013-04-14admin阅读(1903)

HiveTableScans Execution Planning Strategy

HiveTableScans is an execution planning strategy (of Hive-specific SparkPlanner) that resolves HiveTableRelation.

Applying HiveTableScans Strategy to Logical Plan (Executing HiveTableScans) — `apply` Method



apply(plan: LogicalPlan): Seq[SparkPlan]

apply(plan: LogicalPlan): Seq[SparkPlan]

Note	`apply` is part of GenericStrategy Contract to generate a collection of SparkPlans for a given logical plan.

apply…FIXME

FileSourceStrategy

2013-04-13admin阅读(1871)

FileSourceStrategy Execution Planning Strategy for LogicalRelations with HadoopFsRelation

FileSourceStrategy is an execution planning strategy that plans scans over collections of files (possibly partitioned or bucketed).

FileSourceStrategy is part of predefined strategies of the Spark Planner.



import org.apache.spark.sql.execution.datasources.FileSourceStrategy

// Enable INFO logging level to see the details of the strategy
val logger = FileSourceStrategy.getClass.getName.replace("$", "")
import org.apache.log4j.{Level, Logger}
Logger.getLogger(logger).setLevel(Level.INFO)

// Create a bucketed data source table
val tableName = "bucketed_4_id"
spark
  .range(100)
  .write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode("overwrite")
  .saveAsTable(tableName)
val q = spark.table(tableName)
val plan = q.queryExecution.optimizedPlan

val executionPlan = FileSourceStrategy(plan).head

scala> println(executionPlan.numberedTreeString)
00 FileScan parquet default.bucketed_4_id[id#140L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/apps/spark-2.3.0-bin-hadoop2.7/spark-warehouse/bucketed_4..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

import org.apache.spark.sql.execution.FileSourceScanExec
val scan = executionPlan.collectFirst { case fsse: FileSourceScanExec => fsse }.get

scala> :type scan
org.apache.spark.sql.execution.FileSourceScanExec

import org.apache.spark.sql.execution.datasources.FileSourceStrategy

// Enable INFO logging level to see the details of the strategy

val logger = FileSourceStrategy.getClass.getName.replace("$", "")

import org.apache.log4j.{Level, Logger}

Logger.getLogger(logger).setLevel(Level.INFO)

// Create a bucketed data source table

val tableName = "bucketed_4_id"

spark

.range(100)

.write

.bucketBy(4, "id")

.sortBy("id")

.mode("overwrite")

.saveAsTable(tableName)

val q = spark.table(tableName)

val plan = q.queryExecution.optimizedPlan

val executionPlan = FileSourceStrategy(plan).head

scala> println(executionPlan.numberedTreeString)

00 FileScan parquet default.bucketed_4_id[id#140L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/apps/spark-2.3.0-bin-hadoop2.7/spark-warehouse/bucketed_4..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

import org.apache.spark.sql.execution.FileSourceScanExec

val scan = executionPlan.collectFirst { case fsse: FileSourceScanExec => fsse }.get

scala> :type scan

org.apache.spark.sql.execution.FileSourceScanExec

FileSourceScanExec supports Bucket Pruning for LogicalRelations over HadoopFsRelation with the bucketing specification with the following:

There is exactly one bucketing column
The number of buckets is greater than 1



// Using the table created above
// There is exactly one bucketing column, i.e. id
// The number of buckets is greater than 1, i.e. 4
val tableName = "bucketed_4_id"
val q = spark.table(tableName).where($"id" isin (50, 90))
val qe = q.queryExecution
val plan = qe.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Filter id#7L IN (50,90)
01 +- Relation[id#7L] parquet

import org.apache.spark.sql.execution.datasources.FileSourceStrategy

// Enable INFO logging level to see the details of the strategy
val logger = FileSourceStrategy.getClass.getName.replace("$", "")
import org.apache.log4j.{Level, Logger}
Logger.getLogger(logger).setLevel(Level.INFO)

scala> val executionPlan = FileSourceStrategy(plan).head
18/11/18 17:56:53 INFO FileSourceStrategy: Pruning directories with:
18/11/18 17:56:53 INFO FileSourceStrategy: Pruned 2 out of 4 buckets.
18/11/18 17:56:53 INFO FileSourceStrategy: Post-Scan Filters: id#7L IN (50,90)
18/11/18 17:56:53 INFO FileSourceStrategy: Output Data Schema: struct<id: bigint>
18/11/18 17:56:53 INFO FileSourceScanExec: Pushed Filters: In(id, [50,90])
executionPlan: org.apache.spark.sql.execution.SparkPlan = ...

scala> println(executionPlan.numberedTreeString)
00 Filter id#7L IN (50,90)
01 +- FileScan parquet default.bucketed_4_id[id#7L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionFilters: [], PushedFilters: [In(id, [50,90])], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 2 out of 4

// Using the table created above

// There is exactly one bucketing column, i.e. id

// The number of buckets is greater than 1, i.e. 4

val tableName = "bucketed_4_id"

val q = spark.table(tableName).where($"id" isin (50, 90))

val qe = q.queryExecution

val plan = qe.optimizedPlan

scala> println(optimizedPlan.numberedTreeString)

00 Filter id#7L IN (50,90)

01 +- Relation[id#7L] parquet

import org.apache.spark.sql.execution.datasources.FileSourceStrategy

// Enable INFO logging level to see the details of the strategy

val logger = FileSourceStrategy.getClass.getName.replace("$", "")

import org.apache.log4j.{Level, Logger}

Logger.getLogger(logger).setLevel(Level.INFO)

scala> val executionPlan = FileSourceStrategy(plan).head

18/11/18 17:56:53 INFO FileSourceStrategy: Pruning directories with:

18/11/18 17:56:53 INFO FileSourceStrategy: Pruned 2 out of 4 buckets.

18/11/18 17:56:53 INFO FileSourceStrategy: Post-Scan Filters: id#7L IN (50,90)

18/11/18 17:56:53 INFO FileSourceStrategy: Output Data Schema: struct<id: bigint>

18/11/18 17:56:53 INFO FileSourceScanExec: Pushed Filters: In(id, [50,90])

executionPlan: org.apache.spark.sql.execution.SparkPlan = ...

scala> println(executionPlan.numberedTreeString)

00 Filter id#7L IN (50,90)

01 +- FileScan parquet default.bucketed_4_id[id#7L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionFilters: [], PushedFilters: [In(id, [50,90])], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 2 out of 4

Tip

Enable INFO logging level for org.apache.spark.sql.execution.datasources.FileSourceStrategy logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.execution.datasources.FileSourceStrategy=INFO

log4j.logger.org.apache.spark.sql.execution.datasources.FileSourceStrategy=INFO

Refer to Logging.

`collectProjectsAndFilters` Method



collectProjectsAndFilters(plan: LogicalPlan):
  (Option[Seq[NamedExpression]], Seq[Expression], LogicalPlan, Map[Attribute, Expression])

collectProjectsAndFilters(plan: LogicalPlan):

(Option[Seq[NamedExpression]], Seq[Expression], LogicalPlan, Map[Attribute, Expression])

collectProjectsAndFilters is a pattern used to destructure a LogicalPlan that can be Project or Filter. Any other LogicalPlan give an all-empty response.

Applying FileSourceStrategy Strategy to Logical Plan (Executing FileSourceStrategy) — `apply` Method



apply(plan: LogicalPlan): Seq[SparkPlan]

apply(plan: LogicalPlan): Seq[SparkPlan]

Note	`apply` is part of GenericStrategy Contract to generate a collection of SparkPlans for a given logical plan.

apply uses PhysicalOperation Scala extractor object to destructure a logical query plan into a tuple of projection and filter expressions together with a leaf logical operator.

apply only works with logical plans that are actually a LogicalRelation with a HadoopFsRelation (possibly as a child of Project and Filter logical operators).

apply computes partitionKeyFilters expression set with the filter expressions that are a subset of the partitionSchema of the HadoopFsRelation.

apply prints out the following INFO message to the logs:



Pruning directories with: [partitionKeyFilters]

Pruning directories with: [partitionKeyFilters]

apply computes afterScanFilters predicate expressions that should be evaluated after the scan.

apply prints out the following INFO message to the logs:



Post-Scan Filters: [afterScanFilters]

Post-Scan Filters: [afterScanFilters]

apply computes readDataColumns attributes that are the required attributes except the partition columns.

apply prints out the following INFO message to the logs:



Output Data Schema: [outputSchema]

Output Data Schema: [outputSchema]

apply creates a FileSourceScanExec physical operator.

If there are any afterScanFilter predicate expressions, apply creates a FilterExec physical operator with them and the FileSourceScanExec operator.

If the output of the FilterExec physical operator is different from the projects expressions, apply creates a ProjectExec physical operator with them and the FilterExec or the FileSourceScanExec operators.

上一页
1
···
8
9
10
11
12
13
14
...
下一页
共 58 页

spark-sql 第11页

ReuseSubquery Physical Query Optimization

apply Method

ReuseExchange Physical Query Optimization

apply Method

PlanSubqueries Physical Query Optimization

Applying PlanSubqueries Rule to Physical Plan (Executing PlanSubqueries) — apply Method

ExtractPythonUDFs Physical Query Optimization

Extracting Python UDFs from Physical Query Plan — extract Internal Method

trySplitFilter Internal Method

EnsureRequirements Physical Query Optimization

createPartitioning Internal Method

defaultNumPreShufflePartitions Internal Method

Enforcing Partition Requirements (Distribution and Ordering) of Physical Operator — ensureDistributionAndOrdering Internal Method

Adding ExchangeCoordinator (Adaptive Query Execution) — withExchangeCoordinator Internal Method

reorderJoinPredicates Internal Method

SpecialLimits Execution Planning Strategy

Applying SpecialLimits Strategy to Logical Plan (Executing SpecialLimits) — apply Method

JoinSelection Execution Planning Strategy

Is Left-Side Plan At Least 3 Times Smaller Than Right-Side Plan? — muchSmaller Internal Condition

canBuildLocalHashMap Internal Condition

Can Logical Plan Be Broadcast? — canBroadcast Internal Condition

canBroadcastByHints Internal Method

Selecting Build Side Per Join Type and Broadcast Hints — broadcastSideByHints Internal Method

Choosing Join Side to Broadcast — broadcastSide Internal Method

Checking If Join Type Allows For Left Join Side As Build Side — canBuildLeft Internal Condition

Checking If Join Type Allows For Right Join Side As Build Side — canBuildRight Internal Condition

Checking If Join Type and Total Size Statistic of Join Sides Allow for Broadcast Join — canBroadcastBySizes Internal Method

Selecting Build Side Per Join Type and Total Size Statistic of Join Sides — broadcastSideBySizes Internal Method

Applying JoinSelection Strategy to Logical Plan (Executing JoinSelection) — apply Method

Considering BroadcastHashJoinExec Physical Operator

Considering ShuffledHashJoinExec Physical Operator

Considering SortMergeJoinExec Physical Operator

Considering BroadcastNestedLoopJoinExec Physical Operator

Considering CartesianProductExec Physical Operator

InMemoryScans Execution Planning Strategy

Applying InMemoryScans Strategy to Logical Plan (Executing InMemoryScans) — apply Method

HiveTableScans Execution Planning Strategy

Applying HiveTableScans Strategy to Logical Plan (Executing HiveTableScans) — apply Method

FileSourceStrategy Execution Planning Strategy for LogicalRelations with HadoopFsRelation

collectProjectsAndFilters Method

Applying FileSourceStrategy Strategy to Logical Plan (Executing FileSourceStrategy) — apply Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

`apply` Method

`apply` Method

Applying PlanSubqueries Rule to Physical Plan (Executing PlanSubqueries) — `apply` Method

Extracting Python UDFs from Physical Query Plan — `extract` Internal Method

`trySplitFilter` Internal Method

`createPartitioning` Internal Method

`defaultNumPreShufflePartitions` Internal Method

Enforcing Partition Requirements (Distribution and Ordering) of Physical Operator — `ensureDistributionAndOrdering` Internal Method

Adding ExchangeCoordinator (Adaptive Query Execution) — `withExchangeCoordinator` Internal Method

`reorderJoinPredicates` Internal Method

Applying SpecialLimits Strategy to Logical Plan (Executing SpecialLimits) — `apply` Method

Is Left-Side Plan At Least 3 Times Smaller Than Right-Side Plan? — `muchSmaller` Internal Condition

`canBuildLocalHashMap` Internal Condition

Can Logical Plan Be Broadcast? — `canBroadcast` Internal Condition

`canBroadcastByHints` Internal Method

Selecting Build Side Per Join Type and Broadcast Hints — `broadcastSideByHints` Internal Method

Choosing Join Side to Broadcast — `broadcastSide` Internal Method

Checking If Join Type Allows For Left Join Side As Build Side — `canBuildLeft` Internal Condition

Checking If Join Type Allows For Right Join Side As Build Side — `canBuildRight` Internal Condition

Checking If Join Type and Total Size Statistic of Join Sides Allow for Broadcast Join — `canBroadcastBySizes` Internal Method

Selecting Build Side Per Join Type and Total Size Statistic of Join Sides — `broadcastSideBySizes` Internal Method

Applying JoinSelection Strategy to Logical Plan (Executing JoinSelection) — `apply` Method

Applying InMemoryScans Strategy to Logical Plan (Executing InMemoryScans) — `apply` Method

Applying HiveTableScans Strategy to Logical Plan (Executing HiveTableScans) — `apply` Method

`collectProjectsAndFilters` Method

Applying FileSourceStrategy Strategy to Logical Plan (Executing FileSourceStrategy) — `apply` Method