spark-sql-spark技术分享-第8页

GenericStrategy

2013-05-22admin阅读(1777)

GenericStrategy

Executing Planning Strategy — `apply` Method

Caution

FIXME

QueryPlanner — Converting Logical Plan to Physical Trees

2013-05-21admin阅读(3364)

QueryPlanner — Converting Logical Plan to Physical Trees

QueryPlanner plans a logical plan for execution, i.e. converts a logical plan to one or more physical plans using strategies.

Note	`QueryPlanner` generates at least one physical plan.

QueryPlanner‘s main method is plan that defines the extension points, i.e. strategies, collectPlaceholders and prunePlans.

QueryPlanner is part of Catalyst Framework.

QueryPlanner Contract



abstract class QueryPlanner[PhysicalPlan <: TreeNode[PhysicalPlan]] {
  def collectPlaceholders(plan: PhysicalPlan): Seq[(PhysicalPlan, LogicalPlan)]
  def prunePlans(plans: Iterator[PhysicalPlan]): Iterator[PhysicalPlan]
  def strategies: Seq[GenericStrategy[PhysicalPlan]]
}

abstract class QueryPlanner[PhysicalPlan <: TreeNode[PhysicalPlan]] {

def collectPlaceholders(plan: PhysicalPlan): Seq[(PhysicalPlan, LogicalPlan)]

def prunePlans(plans: Iterator[PhysicalPlan]): Iterator[PhysicalPlan]

def strategies: Seq[GenericStrategy[PhysicalPlan]]

}

Table 1. QueryPlanner Contract
Method	Description
`strategies`	Collection of GenericStrategy planning strategies. Used exclusively as an extension point in plan.
`collectPlaceholders`	Collection of “placeholder” physical plans and the corresponding logical plans. Used exclusively as an extension point in plan. Overriden in SparkPlanner
`prunePlans`	Prunes physical plans (e.g. bad or somehow incorrect plans). Used exclusively as an extension point in plan.

Planning Logical Plan — `plan` Method



plan(plan: LogicalPlan): Iterator[PhysicalPlan]

plan(plan: LogicalPlan): Iterator[PhysicalPlan]

plan converts the input plan logical plan to zero or more PhysicalPlan plans.

Internally, plan applies planning strategies to the input plan (one by one collecting all as the plan candidates).

plan then walks over the plan candidates to collect placeholders.

If a plan does not contain a placeholder, the plan is returned as is. Otherwise, plan walks over placeholders (as pairs of PhysicalPlan and unplanned logical plan) and (recursively) plans the child logical plan. plan then replaces the placeholders with the planned child logical plan.

In the end, plan prunes “bad” physical plans.

Note	`plan` is used exclusively (through the concrete SparkPlanner) when a `QueryExecution` is requested for a physical plan.

Catalyst Rule — Named Transformation of TreeNodes

2013-05-20admin阅读(1834)

Catalyst Rule — Named Transformation of TreeNodes

Rule is a named transformation that can be applied to (i.e. executed on or transform) a TreeNode to produce a new TreeNode.



package org.apache.spark.sql.catalyst.rules

abstract class Rule[TreeType <: TreeNode[_]] {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def apply(plan: TreeType): TreeType
}

package org.apache.spark.sql.catalyst.rules

abstract class Rule[TreeType <: TreeNode[_]] {

// only required properties (vals and methods) that have no implementation

// the others follow

def apply(plan: TreeType): TreeType

}

Note	`TreeType` is the type of the TreeNode implementation that a `Rule` can be applied to, i.e. LogicalPlan, SparkPlan or Expression or a combination thereof.

Rule has a rule name (that is the class name of a rule).



ruleName: String

ruleName: String

Rule is mainly used to create a batch of rules for a RuleExecutor.

The other notable use cases of Rule are as follows:

SparkSessionExtensions
When ExperimentalMethods is requested for extraOptimizations
When BaseSessionStateBuilder is requested for customResolutionRules, customPostHocResolutionRules, customOperatorOptimizationRules, and the Optimizer
When Analyzer is requested for extendedResolutionRules and postHocResolutionRules (see BaseSessionStateBuilder and HiveSessionStateBuilder)
When Optimizer is requested for extendedOperatorOptimizationRules
When QueryExecution is requested for preparations

RuleExecutor Contract — Tree Transformation Rule Executor

2013-05-19admin阅读(1738)

RuleExecutor Contract — Tree Transformation Rule Executor

RuleExecutor is the base of rule executors that are responsible for executing a collection of batches (of rules) to transform a TreeNode.



package org.apache.spark.sql.catalyst.rules

abstract class RuleExecutor[TreeType <: TreeNode[_]] {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  protected def batches: Seq[Batch]
}

package org.apache.spark.sql.catalyst.rules

abstract class RuleExecutor[TreeType <: TreeNode[_]] {

// only required properties (vals and methods) that have no implementation

// the others follow

protected def batches: Seq[Batch]

}

Property Description

batches



batches: Seq[Batch]

batches: Seq[Batch]

Collection of rule batches, i.e. a sequence of a collection of rules with a name and a strategy that RuleExecutor uses when executed

Note	`TreeType` is the type of the TreeNode implementation that a `RuleExecutor` can be executed on, i.e. LogicalPlan, SparkPlan, Expression or a combination thereof.

Table 2. RuleExecutors (Direct Implementations)
RuleExecutor	Description
Analyzer	Logical query plan analyzer
`ExpressionCanonicalizer`
Optimizer	Generic logical query plan optimizer

Applying Rule Batches to TreeNode — `execute` Method



execute(plan: TreeType): TreeType

execute(plan: TreeType): TreeType

execute iterates over rule batches and applies rules sequentially to the input plan.

execute tracks the number of iterations and the time of executing each rule (with a plan).

When a rule changes a plan, you should see the following TRACE message in the logs:



TRACE HiveSessionStateBuilder$$anon$1:
=== Applying Rule [ruleName] ===
[currentAndModifiedPlansSideBySide]

TRACE HiveSessionStateBuilder$$anon$1:

=== Applying Rule [ruleName] ===

[currentAndModifiedPlansSideBySide]

After the number of iterations has reached the number of iterations for the batch’s Strategy it stops execution and prints out the following WARN message to the logs:



WARN HiveSessionStateBuilder$$anon$1: Max iterations ([iteration]) reached for batch [batchName]

WARN HiveSessionStateBuilder$$anon$1: Max iterations ([iteration]) reached for batch [batchName]

When the plan has not changed (after applying rules), you should see the following TRACE message in the logs and execute moves on to applying the rules in the next batch. The moment is called fixed point (i.e. when the execution converges).



TRACE HiveSessionStateBuilder$$anon$1: Fixed point reached for batch [batchName] after [iteration] iterations.

TRACE HiveSessionStateBuilder$$anon$1: Fixed point reached for batch [batchName] after [iteration] iterations.

After the batch finishes, if the plan has been changed by the rules, you should see the following DEBUG message in the logs:



DEBUG HiveSessionStateBuilder$$anon$1:
=== Result of Batch [batchName] ===
[currentAndModifiedPlansSideBySide]

DEBUG HiveSessionStateBuilder$$anon$1:

=== Result of Batch [batchName] ===

[currentAndModifiedPlansSideBySide]

Otherwise, when the rules had no changes to a plan, you should see the following TRACE message in the logs:



TRACE HiveSessionStateBuilder$$anon$1: Batch [batchName] has no effect.

TRACE HiveSessionStateBuilder$$anon$1: Batch [batchName] has no effect.

Rule Batch — Collection of Rules

Batch is a named collection of rules with a strategy.

Batch takes the following when created:

Batch name
Strategy
Collection of rules

Batch Execution Strategy

Strategy is the base of the batch execution strategies that indicate the maximum number of executions (aka maxIterations).



abstract class Strategy {
  def maxIterations: Int
}

abstract class Strategy {

def maxIterations: Int

}

Table 3. Strategies
Strategy	Description
`Once`	A strategy that runs only once (with `maxIterations` as `1`)
`FixedPoint`	A strategy that runs until fix point (i.e. converge) or `maxIterations` times, whichever comes first

`isPlanIntegral` Method



isPlanIntegral(plan: TreeType): Boolean

isPlanIntegral(plan: TreeType): Boolean

isPlanIntegral simply returns true.

Note	`isPlanIntegral` is used exclusively when `RuleExecutor` is requested to execute.

QueryPlan — Structured Query Plan

2013-05-18admin阅读(1656)

QueryPlan — Structured Query Plan

QueryPlan is part of Catalyst to build a tree of relational operators of a structured query.

Scala-specific, QueryPlan is an abstract class that is the base class of LogicalPlan and SparkPlan (for logical and physical plans, respectively).

A QueryPlan has an output attributes (that serves as the base for the schema), a collection of expressions and a schema.

QueryPlan has statePrefix that is used when displaying a plan with ! to indicate an invalid plan, and ' to indicate an unresolved plan.

A QueryPlan is invalid if there are missing input attributes and children subnodes are non-empty.

A QueryPlan is unresolved if the column names have not been verified and column types have not been looked up in the Catalog.

A QueryPlan has zero, one or more Catalyst expressions.

Note	`QueryPlan` is a tree of operators that have a tree of expressions.

QueryPlan has references property that is the attributes that appear in expressions from this operator.

QueryPlan Contract



abstract class QueryPlan[T] extends TreeNode[T] {
  def output: Seq[Attribute]
  def validConstraints: Set[Expression]
  // FIXME
}

abstract class QueryPlan[T] extends TreeNode[T] {

def output: Seq[Attribute]

def validConstraints: Set[Expression]

// FIXME

}

Table 1. QueryPlan Contract
Method	Description
`validConstraints`
output	Attribute expressions

Transforming Expressions — `transformExpressions` Method



transformExpressions(rule: PartialFunction[Expression, Expression]): this.type

transformExpressions(rule: PartialFunction[Expression, Expression]): this.type

transformExpressions simply executes transformExpressionsDown with the input rule.

Note	`transformExpressions` is used when…FIXME

Transforming Expressions — `transformExpressionsDown` Method



transformExpressionsDown(rule: PartialFunction[Expression, Expression]): this.type

transformExpressionsDown(rule: PartialFunction[Expression, Expression]): this.type

transformExpressionsDown applies the rule to each expression in the query operator.

Note	`transformExpressionsDown` is used when…FIXME

Applying Transformation Function to Each Expression in Query Operator — `mapExpressions` Method



mapExpressions(f: Expression => Expression): this.type

mapExpressions(f: Expression => Expression): this.type

mapExpressions…FIXME

Note	`mapExpressions` is used when…FIXME

Output Schema Attribute Set — `outputSet` Property



outputSet: AttributeSet

outputSet: AttributeSet

outputSet simply returns an AttributeSet for the output schema attributes.

Note	`outputSet` is used when…FIXME

`producedAttributes` Property

Caution

FIXME

Missing Input Attributes — `missingInput` Property



def missingInput: AttributeSet

def missingInput: AttributeSet

missingInput are attributes that are referenced in expressions but not provided by this node’s children (as inputSet) and are not produced by this node (as producedAttributes).

Output Schema — `schema` Property

You can request the schema of a QueryPlan using schema that builds StructType from the output attributes.



// the query
val dataset = spark.range(3)

scala> dataset.queryExecution.analyzed.schema
res6: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))

// the query

val dataset = spark.range(3)

scala> dataset.queryExecution.analyzed.schema

res6: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))

Output Schema Attributes — `output` Property



output: Seq[Attribute]

output: Seq[Attribute]

output is a collection of Catalyst attribute expressions that represent the result of a projection in a query that is later used to build the output schema.

Note	`output` property is also called output schema or result schema.



val q = spark.range(3)

scala> q.queryExecution.analyzed.output
res0: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

scala> q.queryExecution.withCachedData.output
res1: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

scala> q.queryExecution.optimizedPlan.output
res2: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

scala> q.queryExecution.sparkPlan.output
res3: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

scala> q.queryExecution.executedPlan.output
res4: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

val q = spark.range(3)

scala> q.queryExecution.analyzed.output

res0: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

scala> q.queryExecution.withCachedData.output

res1: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

scala> q.queryExecution.optimizedPlan.output

res2: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

scala> q.queryExecution.sparkPlan.output

res3: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

scala> q.queryExecution.executedPlan.output

res4: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = List(id#0L)

Tip

You can build a StructType from output collection of attributes using toStructType method (that is available through the implicit class AttributeSeq).



scala> q.queryExecution.analyzed.output.toStructType
res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))

scala> q.queryExecution.analyzed.output.toStructType

res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))

Simple (Basic) Description with State Prefix — `simpleString` Method



simpleString: String

simpleString: String

Note	`simpleString` is part of TreeNode Contract for the simple text description of a tree node.

simpleString adds a state prefix to the node’s simple text description.

State Prefix — `statePrefix` Method



statePrefix: String

statePrefix: String

Internally, statePrefix gives ! (exclamation mark) when the node is invalid, i.e. missingInput is not empty, and the node is a parent node. Otherwise, statePrefix gives an empty string.

Note	`statePrefix` is used exclusively when `QueryPlan` is requested for the simple text node description.

Transforming All Expressions — `transformAllExpressions` Method



transformAllExpressions(rule: PartialFunction[Expression, Expression]): this.type

transformAllExpressions(rule: PartialFunction[Expression, Expression]): this.type

transformAllExpressions…FIXME

Note	`transformAllExpressions` is used when…FIXME

Simple (Basic) Description with State Prefix — `verboseString` Method



verboseString: String

verboseString: String

Note	`verboseString` is part of TreeNode Contract to…FIXME.

verboseString simply returns the simple (basic) description with state prefix.

`innerChildren` Method



innerChildren: Seq[QueryPlan[_]]

innerChildren: Seq[QueryPlan[_]]

Note	`innerChildren` is part of TreeNode Contract to…FIXME.

innerChildren simply returns the subqueries.

`subqueries` Method



subqueries: Seq[PlanType]

subqueries: Seq[PlanType]

subqueries…FIXME

Note	`subqueries` is used when…FIXME

Canonicalizing Query Plan — `doCanonicalize` Method



doCanonicalize(): PlanType

doCanonicalize(): PlanType

doCanonicalize…FIXME

Note	`doCanonicalize` is used when…FIXME

TreeNode — Node in Catalyst Tree

2013-05-17admin阅读(1910)

TreeNode — Node in Catalyst Tree

TreeNode is the contract of nodes in Catalyst tree with name and zero or more children.



package org.apache.spark.sql.catalyst.trees

abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
  self: BaseType =>

  // only required properties (vals and methods) that have no implementation
  // the others follow
  def children: Seq[BaseType]
  def verboseString: String
}

package org.apache.spark.sql.catalyst.trees

abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {

self: BaseType =>

// only required properties (vals and methods) that have no implementation

// the others follow

def children: Seq[BaseType]

def verboseString: String

}

TreeNode is a recursive data structure that can have one or many children that are again TreeNodes.

Tip	Read up on `<:` type operator in Scala in Upper Type Bounds.

Scala-specific, TreeNode is an abstract class that is the base class of Catalyst Expression and QueryPlan abstract classes.

TreeNode therefore allows for building entire trees of TreeNodes, e.g. generic query plans with concrete logical and physical operators that both use Catalyst expressions (which are TreeNodes again).

Note	Spark SQL uses `TreeNode` for query plans and Catalyst expressions that can further be used together to build more advanced trees, e.g. Catalyst expressions can have query plans as subquery expressions.

TreeNode can itself be a node in a tree or a collection of nodes, i.e. itself and the children nodes. Not only does TreeNode come with the methods that you may have used in Scala Collection API (e.g. map, flatMap, collect, collectFirst, foreach), but also specialized ones for more advanced tree manipulation, e.g. mapChildren, transform, transformDown, transformUp, foreachUp, numberedTreeString, p, asCode, prettyJson.

Table 1. TreeNode API (Public Methods)

Method

Description

apply



apply(number: Int): TreeNode[_]

apply(number: Int): TreeNode[_]

argString



argString: String

argString: String

asCode



asCode: String

asCode: String

collect



collect[B](pf: PartialFunction[BaseType, B]): Seq[B]

collect[B](pf: PartialFunction[BaseType, B]): Seq[B]

collectFirst



collectFirst[B](pf: PartialFunction[BaseType, B]): Option[B]

collectFirst[B](pf: PartialFunction[BaseType, B]): Option[B]

collectLeaves



collectLeaves(): Seq[BaseType]

collectLeaves(): Seq[BaseType]

fastEquals



fastEquals(other: TreeNode[_]): Boolean

fastEquals(other: TreeNode[_]): Boolean

find



find(f: BaseType => Boolean): Option[BaseType]

find(f: BaseType => Boolean): Option[BaseType]

flatMap



flatMap[A](f: BaseType => TraversableOnce[A]): Seq[A]

flatMap[A](f: BaseType => TraversableOnce[A]): Seq[A]

foreach



foreach(f: BaseType => Unit): Unit

foreach(f: BaseType => Unit): Unit

foreachUp



foreachUp(f: BaseType => Unit): Unit

foreachUp(f: BaseType => Unit): Unit

generateTreeString



generateTreeString(
  depth: Int,
  lastChildren: Seq[Boolean],
  builder: StringBuilder,
  verbose: Boolean,
  prefix: String = "",
  addSuffix: Boolean = false): StringBuilder

generateTreeString(

depth: Int,

lastChildren: Seq[Boolean],

builder: StringBuilder,

verbose: Boolean,

prefix: String = "",

addSuffix: Boolean = false): StringBuilder

map



map[A](f: BaseType => A): Seq[A]

map[A](f: BaseType => A): Seq[A]

mapChildren



mapChildren(f: BaseType => BaseType): BaseType

mapChildren(f: BaseType => BaseType): BaseType

nodeName



nodeName: String

nodeName: String

numberedTreeString



numberedTreeString: String

numberedTreeString: String



p(number: Int): BaseType

p(number: Int): BaseType

prettyJson



prettyJson: String

prettyJson: String

simpleString



simpleString: String

simpleString: String

toJSON



toJSON: String

toJSON: String

transform



transform(rule: PartialFunction[BaseType, BaseType]): BaseType

transform(rule: PartialFunction[BaseType, BaseType]): BaseType

transformDown



transformDown(rule: PartialFunction[BaseType, BaseType]): BaseType

transformDown(rule: PartialFunction[BaseType, BaseType]): BaseType

transformUp



transformUp(rule: PartialFunction[BaseType, BaseType]): BaseType

transformUp(rule: PartialFunction[BaseType, BaseType]): BaseType

treeString



treeString: String
treeString(verbose: Boolean, addSuffix: Boolean = false): String

treeString: String

treeString(verbose: Boolean, addSuffix: Boolean = false): String

verboseString



verboseString: String

verboseString: String

verboseStringWithSuffix



verboseStringWithSuffix: String

verboseStringWithSuffix: String

withNewChildren



withNewChildren(newChildren: Seq[BaseType]): BaseType

withNewChildren(newChildren: Seq[BaseType]): BaseType

Table 2. (Subset of) TreeNode Contract
Method	Description
`children`	Child nodes
`verboseString`	One-line verbose description Used when `TreeNode` is requested for generateTreeString (with `verbose` flag enabled) and verboseStringWithSuffix

Table 3. TreeNodes
TreeNode	Description
Expression
QueryPlan

Tip

TreeNode abstract type is a fairly advanced Scala type definition (at least comparing to the other Scala types in Spark) so understanding its behaviour even outside Spark might be worthwhile by itself.



abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
  self: BaseType =>

  // ...
}

abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {

self: BaseType =>

// ...

}

`withNewChildren` Method



withNewChildren(newChildren: Seq[BaseType]): BaseType

withNewChildren(newChildren: Seq[BaseType]): BaseType

withNewChildren…FIXME

Note	`withNewChildren` is used when…FIXME

Simple Node Description — `simpleString` Method



simpleString: String

simpleString: String

simpleString gives a simple one-line description of a TreeNode.

Internally, simpleString is the nodeName followed by argString separated by a single white space.

Note	`simpleString` is used when `TreeNode` is requested for argString (of child nodes) and tree text representation (with `verbose` flag off).

Numbered Text Representation — `numberedTreeString` Method



numberedTreeString: String

numberedTreeString: String

numberedTreeString adds numbers to the text representation of all the nodes.

Note	`numberedTreeString` is used primarily for interactive debugging using apply and p methods.

Getting n-th TreeNode in Tree (for Interactive Debugging) — `apply` Method



apply(number: Int): TreeNode[_]

apply(number: Int): TreeNode[_]

apply gives number-th tree node in a tree.

Note	`apply` can be used for interactive debugging.

Internally, apply gets the node at number position or null.

Getting n-th BaseType in Tree (for Interactive Debugging) — `p` Method



p(number: Int): BaseType

p(number: Int): BaseType

p gives number-th tree node in a tree as BaseType for interactive debugging.

Note	`p` can be used for interactive debugging.

Note	`BaseType` is the base type of a tree and in Spark SQL can be: LogicalPlan for logical plan trees SparkPlan for physical plan trees Expression for expression trees

Text Representation — `toString` Method



toString: String

toString: String

Note	`toString` is part of Java’s Object Contract for the string representation of an object, e.g. `TreeNode`.

toString simply returns the text representation of all nodes in the tree.

Text Representation of All Nodes in Tree — `treeString` Method



treeString: String  (1)
treeString(verbose: Boolean, addSuffix: Boolean = false): String

treeString: String (1)

treeString(verbose: Boolean, addSuffix: Boolean = false): String

Turns verbose flag on

treeString gives the string representation of all the nodes in the TreeNode.



import org.apache.spark.sql.{functions => f}
val q = spark.range(10).withColumn("rand", f.rand())
val executedPlan = q.queryExecution.executedPlan

val output = executedPlan.treeString(verbose = true)

scala> println(output)
*(1) Project [id#0L, rand(6790207094253656854) AS rand#2]
+- *(1) Range (0, 10, step=1, splits=8)

import org.apache.spark.sql.{functions => f}

val q = spark.range(10).withColumn("rand", f.rand())

val executedPlan = q.queryExecution.executedPlan

val output = executedPlan.treeString(verbose = true)

scala> println(output)

*(1) Project [id#0L, rand(6790207094253656854) AS rand#2]

+- *(1) Range (0, 10, step=1, splits=8)

Note	`treeString` is used when: `TreeNode` is requested for the numbered text representation and the text representation `QueryExecution` is requested for simple, extended and with statistics text representations

Verbose Description with Suffix — `verboseStringWithSuffix` Method



verboseStringWithSuffix: String

verboseStringWithSuffix: String

verboseStringWithSuffix simply returns verbose description.

Note	`verboseStringWithSuffix` is used exclusively when `TreeNode` is requested to generateTreeString (with `verbose` and `addSuffix` flags enabled).

Generating Text Representation of Inner and Regular Child Nodes — `generateTreeString` Method



generateTreeString(
  depth: Int,
  lastChildren: Seq[Boolean],
  builder: StringBuilder,
  verbose: Boolean,
  prefix: String = "",
  addSuffix: Boolean = false): StringBuilder

generateTreeString(

depth: Int,

lastChildren: Seq[Boolean],

builder: StringBuilder,

verbose: Boolean,

prefix: String = "",

addSuffix: Boolean = false): StringBuilder

Internally, generateTreeString appends the following node descriptions per the verbose and addSuffix flags:

verbose description with suffix when both are enabled (i.e. verbose and addSuffix flags are all true)
verbose description when verbose is enabled (i.e. verbose is true and addSuffix is false)
simple description when verbose is disabled (i.e. verbose is false)

In the end, generateTreeString calls itself recursively for the innerChildren and the child nodes.

Note	`generateTreeString` is used exclusively when `TreeNode` is requested for text representation of all nodes in the tree.

Inner Child Nodes — `innerChildren` Method



innerChildren: Seq[TreeNode[_]]

innerChildren: Seq[TreeNode[_]]

innerChildren returns the inner nodes that should be shown as an inner nested tree of this node.

innerChildren simply returns an empty collection of TreeNodes.

Note	`innerChildren` is used when `TreeNode` is requested to generate the text representation of inner and regular child nodes, allChildren and getNodeNumbered.

`allChildren` Property



allChildren: Set[TreeNode[_]]

allChildren: Set[TreeNode[_]]

Note	`allChildren` is a Scala lazy value which is computed once when accessed and cached afterwards.

allChildren…FIXME

Note	`allChildren` is used when…FIXME

`getNodeNumbered` Internal Method



getNodeNumbered(number: MutableInt): Option[TreeNode[_]]

getNodeNumbered(number: MutableInt): Option[TreeNode[_]]

getNodeNumbered…FIXME

Note	`getNodeNumbered` is used when…FIXME

`foreach` Method



foreach(f: BaseType => Unit): Unit

foreach(f: BaseType => Unit): Unit

foreach applies the input function f to itself (this) first and then (recursively) to the children.

`collect` Method



collect[B](pf: PartialFunction[BaseType, B]): Seq[B]

collect[B](pf: PartialFunction[BaseType, B]): Seq[B]

collect…FIXME

`collectFirst` Method



collectFirst[B](pf: PartialFunction[BaseType, B]): Option[B]

collectFirst[B](pf: PartialFunction[BaseType, B]): Option[B]

collectFirst…FIXME

`collectLeaves` Method



collectLeaves(): Seq[BaseType]

collectLeaves(): Seq[BaseType]

collectLeaves…FIXME

`find` Method



find(f: BaseType => Boolean): Option[BaseType]

find(f: BaseType => Boolean): Option[BaseType]

find…FIXME

`flatMap` Method



flatMap[A](f: BaseType => TraversableOnce[A]): Seq[A]

flatMap[A](f: BaseType => TraversableOnce[A]): Seq[A]

flatMap…FIXME

`foreachUp` Method



foreachUp(f: BaseType => Unit): Unit

foreachUp(f: BaseType => Unit): Unit

foreachUp…FIXME

`map` Method



map[A](f: BaseType => A): Seq[A]

map[A](f: BaseType => A): Seq[A]

map…FIXME

`mapChildren` Method



mapChildren(f: BaseType => BaseType): BaseType

mapChildren(f: BaseType => BaseType): BaseType

mapChildren…FIXME

`transform` Method



transform(rule: PartialFunction[BaseType, BaseType]): BaseType

transform(rule: PartialFunction[BaseType, BaseType]): BaseType

transform…FIXME

Transforming Nodes Downwards — `transformDown` Method



transformDown(rule: PartialFunction[BaseType, BaseType]): BaseType

transformDown(rule: PartialFunction[BaseType, BaseType]): BaseType

transformDown…FIXME

`transformUp` Method



transformUp(rule: PartialFunction[BaseType, BaseType]): BaseType

transformUp(rule: PartialFunction[BaseType, BaseType]): BaseType

transformUp…FIXME

`asCode` Method



asCode: String

asCode: String

asCode…FIXME

`prettyJson` Method



prettyJson: String

prettyJson: String

prettyJson…FIXME

Note	`prettyJson` is used when…FIXME

`toJSON` Method



toJSON: String

toJSON: String

toJSON…FIXME

Note	`toJSON` is used when…FIXME

`argString` Method



argString: String

argString: String

argString…FIXME

Note	`argString` is used when…FIXME

`nodeName` Method



nodeName: String

nodeName: String

nodeName returns the name of the class with Exec suffix removed (that is used as a naming convention for the class name of physical operators).

Note	`nodeName` is used when `TreeNode` is requested for simpleString and asCode.

`fastEquals` Method



fastEquals(other: TreeNode[_]): Boolean

fastEquals(other: TreeNode[_]): Boolean

fastEquals…FIXME

Note	`fastEquals` is used when…FIXME

Catalyst — Tree Manipulation Framework

2013-05-16admin阅读(1815)

Catalyst — Tree Manipulation Framework

Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.

Note	The Catalyst framework were first introduced in SPARK-1251 Support for optimizing and executing structured queries and became part of Apache Spark on 20/Mar/14 19:12.

The main abstraction in Catalyst is TreeNode that is then used to build trees of Expressions or QueryPlans.

Spark 2.0 uses the Catalyst tree manipulation framework to build an extensible query plan optimizer with a number of query optimizations.

Catalyst supports both rule-based and cost-based optimization.

Debugging Query Execution

2013-05-15admin阅读(2014)

Debugging Query Execution

debug package object contains tools for debugging query execution, i.e. a full analysis of structured queries (as Datasets).

Table 1. Debugging Query Execution Tools (debug Methods)

Method

Description

debug

Debugging a structured query



debug(): Unit

debug(): Unit

debugCodegen

Displays the Java source code generated for a structured query in whole-stage code generation (i.e. the output of each WholeStageCodegen subtree in a query plan).



debugCodegen(): Unit

debugCodegen(): Unit

debug package object is in org.apache.spark.sql.execution.debug package that you have to import before you can use the debug and debugCodegen methods.



// Import the package object
import org.apache.spark.sql.execution.debug._

// Every Dataset (incl. DataFrame) has now the debug and debugCodegen methods
val q: DataFrame = ...
q.debug
q.debugCodegen

// Import the package object

import org.apache.spark.sql.execution.debug._

// Every Dataset (incl. DataFrame) has now the debug and debugCodegen methods

val q: DataFrame = ...

q.debug

q.debugCodegen

Tip	Read up on Package Objects in the Scala programming language.

Internally, debug package object uses DebugQuery implicit class that “extends” Dataset[_] Scala type with the debug methods.



implicit class DebugQuery(query: Dataset[_]) {
  def debug(): Unit = ...
  def debugCodegen(): Unit = ...
}

implicit class DebugQuery(query: Dataset[_]) {

def debug(): Unit = ...

def debugCodegen(): Unit = ...

}

Tip	Read up on Implicit Classes in the official documentation of the Scala programming language.

Debugging Dataset — `debug` Method



debug(): Unit

debug(): Unit

debug requests the QueryExecution (of the structured query) for the optimized physical query plan.

debug transforms the optimized physical query plan to add a new DebugExec physical operator for every physical operator.

debug requests the query plan to execute and then counts the number of rows in the result. It prints out the following message:



Results returned: [count]

Results returned: [count]

In the end, debug requests every DebugExec physical operator (in the query plan) to dumpStats.



val q = spark.range(10).where('id === 4)

scala> :type q
org.apache.spark.sql.Dataset[Long]

// Extend Dataset[Long] with debug and debugCodegen methods
import org.apache.spark.sql.execution.debug._

scala> q.debug
Results returned: 1
== WholeStageCodegen ==
Tuples output: 1
 id LongType: {java.lang.Long}
== Filter (id#0L = 4) ==
Tuples output: 0
 id LongType: {}
== Range (0, 10, step=1, splits=8) ==
Tuples output: 0
 id LongType: {}

val q = spark.range(10).where('id === 4)

scala> :type q

org.apache.spark.sql.Dataset[Long]

// Extend Dataset[Long] with debug and debugCodegen methods

import org.apache.spark.sql.execution.debug._

scala> q.debug

Results returned: 1

== WholeStageCodegen ==

Tuples output: 1

id LongType: {java.lang.Long}

== Filter (id#0L = 4) ==

Tuples output: 0

id LongType: {}

== Range (0, 10, step=1, splits=8) ==

Tuples output: 0

id LongType: {}

Displaying Java Source Code Generated for Structured Query in Whole-Stage Code Generation (“Debugging” Codegen) — `debugCodegen` Method



debugCodegen(): Unit

debugCodegen(): Unit

debugCodegen requests the QueryExecution (of the structured query) for the optimized physical query plan.

In the end, debugCodegen simply codegenString the query plan and prints it out to the standard output.



import org.apache.spark.sql.execution.debug._

scala> spark.range(10).where('id === 4).debugCodegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*Filter (id#29L = 4)
+- *Range (0, 10, splits=8)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
...

import org.apache.spark.sql.execution.debug._

scala> spark.range(10).where('id === 4).debugCodegen

Found 1 WholeStageCodegen subtrees.

== Subtree 1 / 1 ==

*Filter (id#29L = 4)

+- *Range (0, 10, splits=8)

Generated code:

/* 001 */ public Object generate(Object[] references) {

/* 002 */ return new GeneratedIterator(references);

/* 003 */ }

/* 004 */

/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {

/* 006 */ private Object[] references;

...

Note

debugCodegen is equivalent to using debug interface of the QueryExecution.



val q = spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)
scala> q.queryExecution.debug.codegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*Project [(id#3L + 6) AS (((id + 1) + 2) + 3)#6L, (id#3L + 15) AS (((id + 4) + 5) + 6)#7L]
+- *Range (1, 1000, step=1, splits=8)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
...

val q = spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)

scala> q.queryExecution.debug.codegen

Found 1 WholeStageCodegen subtrees.

== Subtree 1 / 1 ==

*Project [(id#3L + 6) AS (((id + 1) + 2) + 3)#6L, (id#3L + 15) AS (((id + 4) + 5) + 6)#7L]

+- *Range (1, 1000, step=1, splits=8)

Generated code:

/* 001 */ public Object generate(Object[] references) {

/* 002 */ return new GeneratedIterator(references);

/* 003 */ }

/* 004 */

/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {

...

`codegenToSeq` Method



codegenToSeq(): Seq[(String, String)]

codegenToSeq(): Seq[(String, String)]

codegenToSeq…FIXME

Note	`codegenToSeq` is used when…FIXME

`codegenString` Method



codegenString(plan: SparkPlan): String

codegenString(plan: SparkPlan): String

codegenString…FIXME

Note	`codegenString` is used when…FIXME

Number of Partitions for groupBy Aggregation

2013-05-14admin阅读(1429)

Case Study: Number of Partitions for groupBy Aggregation

Important

As it fairly often happens in my life, right after I had described the discovery I found out I was wrong and the “Aha moment” was gone.

Until I thought about the issue again and took the shortest path possible. See Case 4 for the definitive solution.

I’m leaving the page with no changes in-between so you can read it and learn from my mistakes.

The goal of the case study is to fine tune the number of partitions used for groupBy aggregation.

Given the following 2-partition dataset the task is to write a structured query so there are no empty partitions (or as little as possible).



// 2-partition dataset
val ids = spark.range(start = 0, end = 4, step = 1, numPartitions = 2)
scala> ids.show
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
+---+

scala> ids.rdd.toDebugString
res1: String =
(2) MapPartitionsRDD[8] at rdd at <console>:26 []
 |  MapPartitionsRDD[7] at rdd at <console>:26 []
 |  MapPartitionsRDD[6] at rdd at <console>:26 []
 |  MapPartitionsRDD[5] at rdd at <console>:26 []
 |  ParallelCollectionRDD[4] at rdd at <console>:26 []

// 2-partition dataset

val ids = spark.range(start = 0, end = 4, step = 1, numPartitions = 2)

scala> ids.show

+---+

| id|

+---+

| 0|

| 1|

| 2|

| 3|

+---+

scala> ids.rdd.toDebugString

res1: String =

(2) MapPartitionsRDD[8] at rdd at <console>:26 []

| MapPartitionsRDD[7] at rdd at <console>:26 []

| MapPartitionsRDD[6] at rdd at <console>:26 []

| MapPartitionsRDD[5] at rdd at <console>:26 []

| ParallelCollectionRDD[4] at rdd at <console>:26 []

Note

By default Spark SQL uses spark.sql.shuffle.partitions number of partitions for aggregations and joins, i.e. 200 by default.

That often leads to explosion of partitions for nothing that does impact the performance of a query since these 200 tasks (per partition) have all to start and finish before you get the result.

Less is more remember?

Case 1: Default Number of Partitions — spark.sql.shuffle.partitions Property

This is the moment when you learn that sometimes relying on defaults may lead to poor performance.

Think how many partitions the following query really requires?



val groupingExpr = 'id % 2 as "group"
val q = ids.
  groupBy(groupingExpr).
  agg(count($"id") as "count")

val groupingExpr = 'id % 2 as "group"

val q = ids.

groupBy(groupingExpr).

agg(count($"id") as "count")

You may have expected to have at most 2 partitions given the number of groups.

Wrong!



scala> q.explain
== Physical Plan ==
*HashAggregate(keys=[(id#0L % 2)#17L], functions=[count(1)])
+- Exchange hashpartitioning((id#0L % 2)#17L, 200)
   +- *HashAggregate(keys=[(id#0L % 2) AS (id#0L % 2)#17L], functions=[partial_count(1)])
      +- *Range (0, 4, step=1, splits=2)

scala> q.rdd.toDebugString
res5: String =
(200) MapPartitionsRDD[16] at rdd at <console>:30 []
  |   MapPartitionsRDD[15] at rdd at <console>:30 []
  |   MapPartitionsRDD[14] at rdd at <console>:30 []
  |   ShuffledRowRDD[13] at rdd at <console>:30 []
  +-(2) MapPartitionsRDD[12] at rdd at <console>:30 []
     |  MapPartitionsRDD[11] at rdd at <console>:30 []
     |  MapPartitionsRDD[10] at rdd at <console>:30 []
     |  ParallelCollectionRDD[9] at rdd at <console>:30 []

scala> q.explain

== Physical Plan ==

*HashAggregate(keys=[(id#0L % 2)#17L], functions=[count(1)])

+- Exchange hashpartitioning((id#0L % 2)#17L, 200)

+- *HashAggregate(keys=[(id#0L % 2) AS (id#0L % 2)#17L], functions=[partial_count(1)])

+- *Range (0, 4, step=1, splits=2)

scala> q.rdd.toDebugString

res5: String =

(200) MapPartitionsRDD[16] at rdd at <console>:30 []

| MapPartitionsRDD[15] at rdd at <console>:30 []

| MapPartitionsRDD[14] at rdd at <console>:30 []

| ShuffledRowRDD[13] at rdd at <console>:30 []

+-(2) MapPartitionsRDD[12] at rdd at <console>:30 []

| MapPartitionsRDD[11] at rdd at <console>:30 []

| MapPartitionsRDD[10] at rdd at <console>:30 []

| ParallelCollectionRDD[9] at rdd at <console>:30 []

When you execute the query you should see 200 or so partitions in use in web UI.



scala> q.show
+-----+-----+
|group|count|
+-----+-----+
|    0|    2|
|    1|    2|
+-----+-----+

scala> q.show

+-----+-----+

|group|count|

+-----+-----+

| 0| 2|

| 1| 2|

+-----+-----+

spark sql performance tuning groupBy aggregation case1.png

Figure 1. Case 1’s Physical Plan with Default Number of Partitions

Note	The number of Succeeded Jobs is 5.

Case 2: Using repartition Operator

Let’s rewrite the query to use repartition operator.

repartition operator is indeed a step in a right direction when used with caution as it may lead to an unnecessary shuffle (aka exchange in Spark SQL’s parlance).

Think how many partitions the following query really requires?



val groupingExpr = 'id % 2 as "group"
val q = ids.
  repartition(groupingExpr). // <-- repartition per groupBy expression
  groupBy(groupingExpr).
  agg(count($"id") as "count")

val groupingExpr = 'id % 2 as "group"

val q = ids.

repartition(groupingExpr). // <-- repartition per groupBy expression

groupBy(groupingExpr).

agg(count($"id") as "count")

You may have expected 2 partitions again?!

Wrong!



scala> q.explain
== Physical Plan ==
*HashAggregate(keys=[(id#6L % 2)#105L], functions=[count(1)])
+- Exchange hashpartitioning((id#6L % 2)#105L, 200)
   +- *HashAggregate(keys=[(id#6L % 2) AS (id#6L % 2)#105L], functions=[partial_count(1)])
      +- Exchange hashpartitioning((id#6L % 2), 200)
         +- *Range (0, 4, step=1, splits=2)

scala> q.rdd.toDebugString
res1: String =
(200) MapPartitionsRDD[57] at rdd at <console>:30 []
  |   MapPartitionsRDD[56] at rdd at <console>:30 []
  |   MapPartitionsRDD[55] at rdd at <console>:30 []
  |   ShuffledRowRDD[54] at rdd at <console>:30 []
  +-(200) MapPartitionsRDD[53] at rdd at <console>:30 []
      |   MapPartitionsRDD[52] at rdd at <console>:30 []
      |   ShuffledRowRDD[51] at rdd at <console>:30 []
      +-(2) MapPartitionsRDD[50] at rdd at <console>:30 []
         |  MapPartitionsRDD[49] at rdd at <console>:30 []
         |  MapPartitionsRDD[48] at rdd at <console>:30 []
         |  ParallelCollectionRDD[47] at rdd at <console>:30 []

scala> q.explain

== Physical Plan ==

*HashAggregate(keys=[(id#6L % 2)#105L], functions=[count(1)])

+- Exchange hashpartitioning((id#6L % 2)#105L, 200)

+- *HashAggregate(keys=[(id#6L % 2) AS (id#6L % 2)#105L], functions=[partial_count(1)])

+- Exchange hashpartitioning((id#6L % 2), 200)

+- *Range (0, 4, step=1, splits=2)

scala> q.rdd.toDebugString

res1: String =

(200) MapPartitionsRDD[57] at rdd at <console>:30 []

| MapPartitionsRDD[56] at rdd at <console>:30 []

| MapPartitionsRDD[55] at rdd at <console>:30 []

| ShuffledRowRDD[54] at rdd at <console>:30 []

+-(200) MapPartitionsRDD[53] at rdd at <console>:30 []

| MapPartitionsRDD[52] at rdd at <console>:30 []

| ShuffledRowRDD[51] at rdd at <console>:30 []

+-(2) MapPartitionsRDD[50] at rdd at <console>:30 []

| MapPartitionsRDD[49] at rdd at <console>:30 []

| MapPartitionsRDD[48] at rdd at <console>:30 []

| ParallelCollectionRDD[47] at rdd at <console>:30 []

Compare the physical plans of the two queries and you will surely regret using repartition operator in the latter as you did cause an extra shuffle stage (!)

Case 3: Using repartition Operator With Explicit Number of Partitions

The discovery of the day is to notice that repartition operator accepts an additional parameter for…the number of partitions (!)

As a matter of fact, there are two variants of repartition operator with the number of partitions and the trick is to use the one with partition expressions (that will be used for grouping as well as…hash partitioning).



repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

Can you think of the number of partitions the following query uses? I’m sure you have guessed correctly!



val groupingExpr = 'id % 2 as "group"
val q = ids.
  repartition(numPartitions = 2, groupingExpr). // <-- repartition per groupBy expression
  groupBy(groupingExpr).
  agg(count($"id") as "count")

val groupingExpr = 'id % 2 as "group"

val q = ids.

repartition(numPartitions = 2, groupingExpr). // <-- repartition per groupBy expression

groupBy(groupingExpr).

agg(count($"id") as "count")

You may have expected 2 partitions again?!

Correct!



scala> q.explain
== Physical Plan ==
*HashAggregate(keys=[(id#6L % 2)#129L], functions=[count(1)])
+- Exchange hashpartitioning((id#6L % 2)#129L, 200)
   +- *HashAggregate(keys=[(id#6L % 2) AS (id#6L % 2)#129L], functions=[partial_count(1)])
      +- Exchange hashpartitioning((id#6L % 2), 2)
         +- *Range (0, 4, step=1, splits=2)

scala> q.rdd.toDebugString
res14: String =
(200) MapPartitionsRDD[78] at rdd at <console>:30 []
  |   MapPartitionsRDD[77] at rdd at <console>:30 []
  |   MapPartitionsRDD[76] at rdd at <console>:30 []
  |   ShuffledRowRDD[75] at rdd at <console>:30 []
  +-(2) MapPartitionsRDD[74] at rdd at <console>:30 []
     |  MapPartitionsRDD[73] at rdd at <console>:30 []
     |  ShuffledRowRDD[72] at rdd at <console>:30 []
     +-(2) MapPartitionsRDD[71] at rdd at <console>:30 []
        |  MapPartitionsRDD[70] at rdd at <console>:30 []
        |  MapPartitionsRDD[69] at rdd at <console>:30 []
        |  ParallelCollectionRDD[68] at rdd at <console>:30 []

scala> q.explain

== Physical Plan ==

*HashAggregate(keys=[(id#6L % 2)#129L], functions=[count(1)])

+- Exchange hashpartitioning((id#6L % 2)#129L, 200)

+- *HashAggregate(keys=[(id#6L % 2) AS (id#6L % 2)#129L], functions=[partial_count(1)])

+- Exchange hashpartitioning((id#6L % 2), 2)

+- *Range (0, 4, step=1, splits=2)

scala> q.rdd.toDebugString

res14: String =

(200) MapPartitionsRDD[78] at rdd at <console>:30 []

| MapPartitionsRDD[77] at rdd at <console>:30 []

| MapPartitionsRDD[76] at rdd at <console>:30 []

| ShuffledRowRDD[75] at rdd at <console>:30 []

+-(2) MapPartitionsRDD[74] at rdd at <console>:30 []

| MapPartitionsRDD[73] at rdd at <console>:30 []

| ShuffledRowRDD[72] at rdd at <console>:30 []

+-(2) MapPartitionsRDD[71] at rdd at <console>:30 []

| MapPartitionsRDD[70] at rdd at <console>:30 []

| MapPartitionsRDD[69] at rdd at <console>:30 []

| ParallelCollectionRDD[68] at rdd at <console>:30 []

Congratulations! You are done.

Not quite. Read along!

Case 4: Remember spark.sql.shuffle.partitions Property? Set It Up Properly



import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, 2)
// spark.conf.set(SHUFFLE_PARTITIONS.key, 2)

scala> spark.sessionState.conf.numShufflePartitions
res8: Int = 2

val q = ids.
  groupBy(groupingExpr).
  agg(count($"id") as "count")

import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS

spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, 2)

// spark.conf.set(SHUFFLE_PARTITIONS.key, 2)

scala> spark.sessionState.conf.numShufflePartitions

res8: Int = 2

val q = ids.

groupBy(groupingExpr).

agg(count($"id") as "count")



scala> q.explain
== Physical Plan ==
*HashAggregate(keys=[(id#0L % 2)#40L], functions=[count(1)])
+- Exchange hashpartitioning((id#0L % 2)#40L, 2)
   +- *HashAggregate(keys=[(id#0L % 2) AS (id#0L % 2)#40L], functions=[partial_count(1)])
      +- *Range (0, 4, step=1, splits=2)

scala> q.rdd.toDebugString
res10: String =
(2) MapPartitionsRDD[31] at rdd at <console>:31 []
 |  MapPartitionsRDD[30] at rdd at <console>:31 []
 |  MapPartitionsRDD[29] at rdd at <console>:31 []
 |  ShuffledRowRDD[28] at rdd at <console>:31 []
 +-(2) MapPartitionsRDD[27] at rdd at <console>:31 []
    |  MapPartitionsRDD[26] at rdd at <console>:31 []
    |  MapPartitionsRDD[25] at rdd at <console>:31 []
    |  ParallelCollectionRDD[24] at rdd at <console>:31 []

scala> q.explain

== Physical Plan ==

*HashAggregate(keys=[(id#0L % 2)#40L], functions=[count(1)])

+- Exchange hashpartitioning((id#0L % 2)#40L, 2)

+- *HashAggregate(keys=[(id#0L % 2) AS (id#0L % 2)#40L], functions=[partial_count(1)])

+- *Range (0, 4, step=1, splits=2)

scala> q.rdd.toDebugString

res10: String =

(2) MapPartitionsRDD[31] at rdd at <console>:31 []

| MapPartitionsRDD[30] at rdd at <console>:31 []

| MapPartitionsRDD[29] at rdd at <console>:31 []

| ShuffledRowRDD[28] at rdd at <console>:31 []

+-(2) MapPartitionsRDD[27] at rdd at <console>:31 []

| MapPartitionsRDD[26] at rdd at <console>:31 []

| MapPartitionsRDD[25] at rdd at <console>:31 []

| ParallelCollectionRDD[24] at rdd at <console>:31 []

spark sql performance tuning groupBy aggregation case4.png

Figure 2. Case 4’s Physical Plan with Custom Number of Partitions

Note	The number of Succeeded Jobs is 2.

Congratulations! You are done now.

Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)

2013-05-13admin阅读(1196)

Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)

From time to time I’m lucky enough to find ways to optimize structured queries in Spark SQL. These findings (or discoveries) usually fall into a study category than a single topic and so the goal of Spark SQL’s Performance Tuning Tips and Tricks chapter is to have a single place for the so-called tips and tricks.

Number of Partitions for groupBy Aggegration

Others

Avoid ObjectType as it turns whole-stage Java code generation off.
Keep whole-stage codegen requirements in mind, in particular avoid physical operators with supportCodegen flag off.

上一页
1
···
5
6
7
8
9
10
11
...
下一页
共 58 页

spark-sql 第8页

GenericStrategy

Executing Planning Strategy — apply Method

QueryPlanner — Converting Logical Plan to Physical Trees

QueryPlanner Contract

Planning Logical Plan — plan Method

Catalyst Rule — Named Transformation of TreeNodes

RuleExecutor Contract — Tree Transformation Rule Executor

Applying Rule Batches to TreeNode — execute Method

Rule Batch — Collection of Rules

Batch Execution Strategy

isPlanIntegral Method

QueryPlan — Structured Query Plan

QueryPlan Contract

Transforming Expressions — transformExpressions Method

Transforming Expressions — transformExpressionsDown Method

Applying Transformation Function to Each Expression in Query Operator — mapExpressions Method

Output Schema Attribute Set — outputSet Property

producedAttributes Property

Missing Input Attributes — missingInput Property

Output Schema — schema Property

Output Schema Attributes — output Property

Simple (Basic) Description with State Prefix — simpleString Method

State Prefix — statePrefix Method

Transforming All Expressions — transformAllExpressions Method

Simple (Basic) Description with State Prefix — verboseString Method

innerChildren Method

subqueries Method

Canonicalizing Query Plan — doCanonicalize Method

TreeNode — Node in Catalyst Tree

withNewChildren Method

Simple Node Description — simpleString Method

Numbered Text Representation — numberedTreeString Method

Getting n-th TreeNode in Tree (for Interactive Debugging) — apply Method

Getting n-th BaseType in Tree (for Interactive Debugging) — p Method

Text Representation — toString Method

Text Representation of All Nodes in Tree — treeString Method

Verbose Description with Suffix — verboseStringWithSuffix Method

Generating Text Representation of Inner and Regular Child Nodes — generateTreeString Method

Inner Child Nodes — innerChildren Method

allChildren Property

getNodeNumbered Internal Method

foreach Method

collect Method

collectFirst Method

collectLeaves Method

find Method

flatMap Method

foreachUp Method

map Method

mapChildren Method

transform Method

Transforming Nodes Downwards — transformDown Method

transformUp Method

asCode Method

prettyJson Method

toJSON Method

argString Method

nodeName Method

fastEquals Method

Catalyst — Tree Manipulation Framework

Debugging Query Execution

Debugging Dataset — debug Method

Displaying Java Source Code Generated for Structured Query in Whole-Stage Code Generation (“Debugging” Codegen) — debugCodegen Method

codegenToSeq Method

codegenString Method

Case Study: Number of Partitions for groupBy Aggregation

Case 1: Default Number of Partitions — spark.sql.shuffle.partitions Property

Case 2: Using repartition Operator

Case 3: Using repartition Operator With Explicit Number of Partitions

Case 4: Remember spark.sql.shuffle.partitions Property? Set It Up Properly

Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)

Others

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Executing Planning Strategy — `apply` Method

Planning Logical Plan — `plan` Method

Applying Rule Batches to TreeNode — `execute` Method

`isPlanIntegral` Method

Transforming Expressions — `transformExpressions` Method

Transforming Expressions — `transformExpressionsDown` Method

Applying Transformation Function to Each Expression in Query Operator — `mapExpressions` Method

Output Schema Attribute Set — `outputSet` Property

`producedAttributes` Property

Missing Input Attributes — `missingInput` Property

Output Schema — `schema` Property

Output Schema Attributes — `output` Property

Simple (Basic) Description with State Prefix — `simpleString` Method

State Prefix — `statePrefix` Method

Transforming All Expressions — `transformAllExpressions` Method

Simple (Basic) Description with State Prefix — `verboseString` Method

`innerChildren` Method

`subqueries` Method

Canonicalizing Query Plan — `doCanonicalize` Method

`withNewChildren` Method

Simple Node Description — `simpleString` Method

Numbered Text Representation — `numberedTreeString` Method

Getting n-th TreeNode in Tree (for Interactive Debugging) — `apply` Method

Getting n-th BaseType in Tree (for Interactive Debugging) — `p` Method

Text Representation — `toString` Method

Text Representation of All Nodes in Tree — `treeString` Method

Verbose Description with Suffix — `verboseStringWithSuffix` Method

Generating Text Representation of Inner and Regular Child Nodes — `generateTreeString` Method

Inner Child Nodes — `innerChildren` Method

`allChildren` Property

`getNodeNumbered` Internal Method

`foreach` Method

`collect` Method

`collectFirst` Method

`collectLeaves` Method

`find` Method

`flatMap` Method

`foreachUp` Method

`map` Method

`mapChildren` Method

`transform` Method

Transforming Nodes Downwards — `transformDown` Method

`transformUp` Method

`asCode` Method

`prettyJson` Method

`toJSON` Method

`argString` Method

`nodeName` Method

`fastEquals` Method

Debugging Dataset — `debug` Method

Displaying Java Source Code Generated for Structured Query in Whole-Stage Code Generation (“Debugging” Codegen) — `debugCodegen` Method

`codegenToSeq` Method

`codegenString` Method