spark-sql-spark技术分享-第38页

ClusteredDistribution

ClusteredDistribution is a Distribution that creates a HashPartitioning for the clustering expressions and a requested number of partitions.

ClusteredDistribution requires that the clustering expressions should not be empty (i.e. Nil).

ClusteredDistribution is created when the following physical operators are requested for a required child distribution:

MapGroupsExec, HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec, WindowExec
Spark Structured Streaming’s FlatMapGroupsWithStateExec, StateStoreRestoreExec, StateStoreSaveExec, StreamingDeduplicateExec, StreamingSymmetricHashJoinExec, StreamingSymmetricHashJoinExec
SparkR’s FlatMapGroupsInRExec
PySpark’s FlatMapGroupsInPandasExec

ClusteredDistribution is used when:

DataSourcePartitioning, SinglePartition, HashPartitioning, and RangePartitioning are requested to satisfies
EnsureRequirements is requested to add an ExchangeCoordinator for Adaptive Query Execution

`createPartitioning` Method



createPartitioning(numPartitions: Int): Partitioning

1

2

3

4

5

createPartitioning(numPartitions: Int): Partitioning

Note	`createPartitioning` is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning creates a HashPartitioning for the clustering expressions and the input numPartitions.

createPartitioning reports an AssertionError when the number of partitions is not the input numPartitions.



This ClusteredDistribution requires [requiredNumPartitions] partitions, but the actual number of partitions is [numPartitions].

1

2

3

4

5

This ClusteredDistribution requires [requiredNumPartitions] partitions, but the actual number of partitions is [numPartitions].

Creating ClusteredDistribution Instance

ClusteredDistribution takes the following when created:

Clustering expressions
Required number of partitions (default: None)

Note	`None` for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of `200` partitions).

BroadcastDistribution

BroadcastDistribution is a Distribution that indicates to use one partition only and…FIXME.

BroadcastDistribution is created when:

BroadcastHashJoinExec is requested for required child output distributions (with HashedRelationBroadcastMode of the build join keys)
BroadcastNestedLoopJoinExec is requested for required child output distributions (with IdentityBroadcastMode)

BroadcastDistribution takes a BroadcastMode when created.

Note	`BroadcastDistribution` is converted to a BroadcastExchangeExec physical operator when EnsureRequirements physical query plan optimization is executed (and enforces partition requirements for data distribution and ordering).

`createPartitioning` Method



createPartitioning(numPartitions: Int): Partitioning

1

2

3

4

5

createPartitioning(numPartitions: Int): Partitioning

Note	`createPartitioning` is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning…FIXME

AllTuples

AllTuples is a Distribution that indicates to use one partition only.

`createPartitioning` Method



createPartitioning(numPartitions: Int): Partitioning

1

2

3

4

5

createPartitioning(numPartitions: Int): Partitioning

Note	`createPartitioning` is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning…FIXME

Distribution — Contract For Data Distribution Across Partitions

2012-07-24admin阅读(1639)

Distribution — Contract For Data Distribution Across Partitions

Distribution is the contract of…FIXME



package org.apache.spark.sql.catalyst.plans.physical

sealed trait Distribution {
  def requiredNumPartitions: Option[Int]
  def createPartitioning(numPartitions: Int): Partitioning
}

1

2

3

4

5

6

7

8

9

10

package org.apache.spark.sql.catalyst.plans.physical

sealed trait Distribution {

def requiredNumPartitions: Option[Int]

def createPartitioning(numPartitions: Int): Partitioning

}

Note	`Distribution` is a Scala `sealed` contract which means that all possible distributions are all in the same compilation unit (file).

Method Description

requiredNumPartitions

Gives the required number of partitions for a distribution.

Used exclusively when EnsureRequirements physical optimization is requested to enforce partition requirements of a physical operator (and a child operator’s output partitioning does not satisfy a required child distribution that leads to inserting a ShuffleExchangeExec operator to a physical plan).

Note	`None` for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of `200` partitions).

createPartitioning

Creates a Partitioning for a given number of partitions.

Used exclusively when EnsureRequirements physical optimization is requested to enforce partition requirements of a physical operator (and creates a ShuffleExchangeExec physical operator with a required Partitioning).

Table 2. Distributions
Distribution	Description
AllTuples
BroadcastDistribution
ClusteredDistribution
HashClusteredDistribution
OrderedDistribution
UnspecifiedDistribution

ExchangeCoordinator

2012-07-23admin阅读(1977)

ExchangeCoordinator

Caution

FIXME

`postShuffleRDD` Method

Caution

FIXME

Partitioning — Specification of Physical Operator’s Output Partitions

2012-07-22admin阅读(1820)

Partitioning — Specification of Physical Operator’s Output Partitions

Partitioning is the contract to hint the Spark Physical Optimizer for the number of partitions the output of a physical operator should be split across.



numPartitions: Int

1

2

3

4

5

numPartitions: Int

numPartitions is used in:

EnsureRequirements physical preparation rule to enforce partition requirements of a physical operator
SortMergeJoinExec for outputPartitioning for FullOuter join type
Partitioning.allCompatible

Table 1. Partitioning Schemes (Partitionings) and Their Properties
Partitioning	compatibleWith	guarantees	numPartitions	satisfies
`BroadcastPartitioning`	`BroadcastPartitioning` with the same `BroadcastMode`	Exactly the same `BroadcastPartitioning`	1	BroadcastDistribution with the same `BroadcastMode`
`HashPartitioning` `clustering` expressions `numPartitions`	`HashPartitioning` (when their underlying expressions are semantically equal, i.e. deterministic and canonically equal)	`HashPartitioning` (when their underlying expressions are semantically equal, i.e. deterministic and canonically equal)	Input `numPartitions`	UnspecifiedDistribution ClusteredDistribution with all the hashing expressions included in `clustering` expressions
`PartitioningCollection` `partitionings`	Any `Partitioning` that is compatible with one of the input `partitionings`	Any `Partitioning` that is guaranteed by any of the input `partitionings`	Number of partitions of the first `Partitioning` in the input `partitionings`	Any `Distribution` that is satisfied by any of the input `partitionings`
`RangePartitioning` `ordering` collection of `SortOrder` `numPartitions`	`RangePartitioning` (when semantically equal, i.e. underlying expressions are deterministic and canonically equal)	`RangePartitioning` (when semantically equal, i.e. underlying expressions are deterministic and canonically equal)	Input `numPartitions`	UnspecifiedDistribution OrderedDistribution with `requiredOrdering` that matches the input `ordering` ClusteredDistribution with all the children of the input `ordering` semantically equal to one of the `clustering` expressions
`RoundRobinPartitioning` `numPartitions`	Always negative	Always negative	Input `numPartitions`	UnspecifiedDistribution
`SinglePartition`	Any `Partitioning` with exactly one partition	Any `Partitioning` with exactly one partition	1	Any `Distribution` except BroadcastDistribution
`UnknownPartitioning` `numPartitions`	Always negative	Always negative	Input `numPartitions`	UnspecifiedDistribution

ProjectEstimation

2012-07-21admin阅读(1299)

ProjectEstimation

ProjectEstimation is…FIXME

Estimating Statistics and Query Hints of Project Logical Operator — `estimate` Method



estimate(project: Project): Option[Statistics]

1

2

3

4

5

estimate(project: Project): Option[Statistics]

estimate…FIXME

Note	`estimate` is used exclusively when `BasicStatsPlanVisitor` is requested to estimate statistics and query hints of a Project logical operator.

JoinEstimation

2012-07-20admin阅读(1443)

JoinEstimation

JoinEstimation is a utility that computes statistics estimates and query hints of a Join logical operator.

JoinEstimation is created exclusively for BasicStatsPlanVisitor to estimate statistics of a Join logical operator.

Note	`BasicStatsPlanVisitor` is used only when cost-based optimization is enabled.

JoinEstimation takes a Join logical operator when created.

When created, JoinEstimation immediately takes the estimated statistics and query hints of the left and right sides of the Join logical operator.



// JoinEstimation requires row count stats for join statistics estimates
// With cost-based optimization off, size in bytes is available only
// That would give no join estimates whatsoever (except size in bytes)
// Make sure that you `--conf spark.sql.cbo.enabled=true`
scala> println(spark.sessionState.conf.cboEnabled)
true

// Build a query with join operator
// From the available data sources tables seem the best...so far
val r1 = spark.range(5)
scala> println(r1.queryExecution.analyzed.stats.simpleString)
sizeInBytes=40.0 B, hints=none

// Make the demo reproducible
val db = spark.catalog.currentDatabase
spark.sharedState.externalCatalog.dropTable(db, table = "t1", ignoreIfNotExists = true, purge = true)
spark.sharedState.externalCatalog.dropTable(db, table = "t2", ignoreIfNotExists = true, purge = true)

// FIXME What relations give row count stats?

// Register tables
spark.range(5).write.saveAsTable("t1")
spark.range(10).write.saveAsTable("t2")

// Refresh internal registries
sql("REFRESH TABLE t1")
sql("REFRESH TABLE t2")

// Calculate row count stats
val tables = Seq("t1", "t2")
tables.map(t => s"ANALYZE TABLE $t COMPUTE STATISTICS").foreach(sql)

val t1 = spark.table("t1")
val t2 = spark.table("t2")

// analyzed plan is just before withCachedData and optimizedPlan plans
// where CostBasedJoinReorder kicks in and optimizes a query using statistics

val t1plan = t1.queryExecution.analyzed
scala> println(t1plan.numberedTreeString)
00 SubqueryAlias t1
01 +- Relation[id#45L] parquet

// Show the stats of every node in the analyzed query plan

val p0 = t1plan.p(0)
scala> println(s"Statistics of ${p0.simpleString}: ${p0.stats.simpleString}")
Statistics of SubqueryAlias t1: sizeInBytes=80.0 B, hints=none

val p1 = t1plan.p(1)
scala> println(s"Statistics of ${p1.simpleString}: ${p1.stats.simpleString}")
Statistics of Relation[id#45L] parquet: sizeInBytes=80.0 B, rowCount=5, hints=none

val t2plan = t2.queryExecution.analyzed

// let's get rid of the SubqueryAlias operator

import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
val t1NoAliasesPlan = EliminateSubqueryAliases(t1plan)
val t2NoAliasesPlan = EliminateSubqueryAliases(t2plan)

// Using Catalyst DSL
import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.plans._
val plan = t1NoAliasesPlan.join(
  otherPlan = t2NoAliasesPlan,
  joinType = Inner,
  condition = Some($"id".expr))
scala> println(plan.numberedTreeString)
00 'Join Inner, 'id
01 :- Relation[id#45L] parquet
02 +- Relation[id#57L] parquet

// Take Join operator off the logical plan
// JoinEstimation works with Joins only
import org.apache.spark.sql.catalyst.plans.logical.Join
val join = plan.collect { case j: Join => j }.head

// Make sure that row count stats are defined per join side
scala> join.left.stats.rowCount.isDefined
res1: Boolean = true

scala> join.right.stats.rowCount.isDefined
res2: Boolean = true

// Make the example reproducible
// Computing stats is once-only process and the estimates are cached
join.invalidateStatsCache

import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation
val stats = JoinEstimation(join).estimate
scala> :type stats
Option[org.apache.spark.sql.catalyst.plans.logical.Statistics]

// Stats have to be available so Option.get should just work
scala> println(stats.get.simpleString)
Some(sizeInBytes=1200.0 B, rowCount=50, hints=none)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

// JoinEstimation requires row count stats for join statistics estimates

// With cost-based optimization off, size in bytes is available only

// That would give no join estimates whatsoever (except size in bytes)

// Make sure that you `--conf spark.sql.cbo.enabled=true`

scala> println(spark.sessionState.conf.cboEnabled)

true

// Build a query with join operator

// From the available data sources tables seem the best...so far

val r1 = spark.range(5)

scala> println(r1.queryExecution.analyzed.stats.simpleString)

sizeInBytes=40.0 B, hints=none

// Make the demo reproducible

val db = spark.catalog.currentDatabase

spark.sharedState.externalCatalog.dropTable(db, table = "t1", ignoreIfNotExists = true, purge = true)

spark.sharedState.externalCatalog.dropTable(db, table = "t2", ignoreIfNotExists = true, purge = true)

// FIXME What relations give row count stats?

// Register tables

spark.range(5).write.saveAsTable("t1")

spark.range(10).write.saveAsTable("t2")

// Refresh internal registries

sql("REFRESH TABLE t1")

sql("REFRESH TABLE t2")

// Calculate row count stats

val tables = Seq("t1", "t2")

tables.map(t => s"ANALYZE TABLE $t COMPUTE STATISTICS").foreach(sql)

val t1 = spark.table("t1")

val t2 = spark.table("t2")

// analyzed plan is just before withCachedData and optimizedPlan plans

// where CostBasedJoinReorder kicks in and optimizes a query using statistics

val t1plan = t1.queryExecution.analyzed

scala> println(t1plan.numberedTreeString)

00 SubqueryAlias t1

01 +- Relation[id#45L] parquet

// Show the stats of every node in the analyzed query plan

val p0 = t1plan.p(0)

scala> println(s"Statistics of ${p0.simpleString}: ${p0.stats.simpleString}")

Statistics of SubqueryAlias t1: sizeInBytes=80.0 B, hints=none

val p1 = t1plan.p(1)

scala> println(s"Statistics of ${p1.simpleString}: ${p1.stats.simpleString}")

Statistics of Relation[id#45L] parquet: sizeInBytes=80.0 B, rowCount=5, hints=none

val t2plan = t2.queryExecution.analyzed

// let's get rid of the SubqueryAlias operator

import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases

val t1NoAliasesPlan = EliminateSubqueryAliases(t1plan)

val t2NoAliasesPlan = EliminateSubqueryAliases(t2plan)

// Using Catalyst DSL

import org.apache.spark.sql.catalyst.dsl.plans._

import org.apache.spark.sql.catalyst.plans._

val plan = t1NoAliasesPlan.join(

otherPlan = t2NoAliasesPlan,

joinType = Inner,

condition = Some($"id".expr))

scala> println(plan.numberedTreeString)

00 'Join Inner, 'id

01 :- Relation[id#45L] parquet

02 +- Relation[id#57L] parquet

// Take Join operator off the logical plan

// JoinEstimation works with Joins only

import org.apache.spark.sql.catalyst.plans.logical.Join

val join = plan.collect { case j: Join => j }.head

// Make sure that row count stats are defined per join side

scala> join.left.stats.rowCount.isDefined

res1: Boolean = true

scala> join.right.stats.rowCount.isDefined

res2: Boolean = true

// Make the example reproducible

// Computing stats is once-only process and the estimates are cached

join.invalidateStatsCache

import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation

val stats = JoinEstimation(join).estimate

scala> :type stats

Option[org.apache.spark.sql.catalyst.plans.logical.Statistics]

// Stats have to be available so Option.get should just work

scala> println(stats.get.simpleString)

Some(sizeInBytes=1200.0 B, rowCount=50, hints=none)

JoinEstimation can estimate statistics and query hints of a Join logical operator with the following join types:

Inner, Cross, LeftOuter, RightOuter, FullOuter, LeftSemi and LeftAnti

For the other join types (e.g. ExistenceJoin), JoinEstimation prints out a DEBUG message to the logs and returns None (to “announce” that no statistics could be computed).



// Demo: Unsupported join type, i.e. ExistenceJoin

// Some parts were copied from the earlier demo
// FIXME Make it self-contained

// Using Catalyst DSL
// Don't even know if such existance join could ever be possible in Spark SQL
// For demo purposes it's OK, isn't it?
import org.apache.spark.sql.catalyst.plans.ExistenceJoin
val left = t1NoAliasesPlan
val right = t2NoAliasesPlan
val plan = left.join(right,
  joinType = ExistenceJoin(exists = 'id.long))

// Take Join operator off the logical plan
// JoinEstimation works with Joins only
import org.apache.spark.sql.catalyst.plans.logical.Join
val join = plan.collect { case j: Join => j }.head

// Enable DEBUG logging level
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation").setLevel(Level.DEBUG)

scala> val stats = JoinEstimation(join).estimate
18/06/13 10:29:37 DEBUG JoinEstimation: [CBO] Unsupported join type: ExistenceJoin(id#35L)
stats: Option[org.apache.spark.sql.catalyst.plans.logical.Statistics] = None

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

// Demo: Unsupported join type, i.e. ExistenceJoin

// Some parts were copied from the earlier demo

// FIXME Make it self-contained

// Using Catalyst DSL

// Don't even know if such existance join could ever be possible in Spark SQL

// For demo purposes it's OK, isn't it?

import org.apache.spark.sql.catalyst.plans.ExistenceJoin

val left = t1NoAliasesPlan

val right = t2NoAliasesPlan

val plan = left.join(right,

joinType = ExistenceJoin(exists = 'id.long))

// Take Join operator off the logical plan

// JoinEstimation works with Joins only

import org.apache.spark.sql.catalyst.plans.logical.Join

val join = plan.collect { case j: Join => j }.head

// Enable DEBUG logging level

import org.apache.log4j.{Level, Logger}

Logger.getLogger("org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation").setLevel(Level.DEBUG)

scala> val stats = JoinEstimation(join).estimate

18/06/13 10:29:37 DEBUG JoinEstimation: [CBO] Unsupported join type: ExistenceJoin(id#35L)

stats: Option[org.apache.spark.sql.catalyst.plans.logical.Statistics] = None



// FIXME Describe the purpose of the demo

// Using Catalyst DSL
import org.apache.spark.sql.catalyst.dsl.plans._

val t1 = table(ref = "t1")

// HACK: Disable symbolToColumn implicit conversion
// It is imported automatically in spark-shell (and makes demos impossible)
// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName
trait ThatWasABadIdea
implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

import org.apache.spark.sql.catalyst.dsl.expressions._
val id = 'id.long

val t2 = table("t2")
import org.apache.spark.sql.catalyst.plans.LeftSemi
val plan = t1.join(t2, joinType = LeftSemi, condition = Some(id))
scala> println(plan.numberedTreeString)
00 'Join LeftSemi, id#2: bigint
01 :- 'UnresolvedRelation `t1`
02 +- 'UnresolvedRelation `t2`

import org.apache.spark.sql.catalyst.plans.logical.Join
val join = plan match { case j: Join => j }

import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation

// FIXME java.lang.UnsupportedOperationException
val stats = JoinEstimation(join).estimate

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

// FIXME Describe the purpose of the demo

// Using Catalyst DSL

import org.apache.spark.sql.catalyst.dsl.plans._

val t1 = table(ref = "t1")

// HACK: Disable symbolToColumn implicit conversion

// It is imported automatically in spark-shell (and makes demos impossible)

// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName

trait ThatWasABadIdea

implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

import org.apache.spark.sql.catalyst.dsl.expressions._

val id = 'id.long

val t2 = table("t2")

import org.apache.spark.sql.catalyst.plans.LeftSemi

val plan = t1.join(t2, joinType = LeftSemi, condition = Some(id))

scala> println(plan.numberedTreeString)

00 'Join LeftSemi, id#2: bigint

01 :- 'UnresolvedRelation `t1`

02 +- 'UnresolvedRelation `t2`

import org.apache.spark.sql.catalyst.plans.logical.Join

val join = plan match { case j: Join => j }

import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation

// FIXME java.lang.UnsupportedOperationException

val stats = JoinEstimation(join).estimate

Tip

Enable DEBUG logging level for org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation=DEBUG

1

2

3

4

5

log4j.logger.org.apache.spark.sql.catalyst.plans.logical.statsEstimation.JoinEstimation=DEBUG

Refer to Logging.

`estimateInnerOuterJoin` Internal Method



estimateInnerOuterJoin(): Option[Statistics]

1

2

3

4

5

estimateInnerOuterJoin(): Option[Statistics]

estimateInnerOuterJoin destructures Join logical operator into a join type with the left and right keys.

estimateInnerOuterJoin simply returns None (i.e. nothing) when either side of the Join logical operator have no row count statistic.

Note	`estimateInnerOuterJoin` is used exclusively when `JoinEstimation` is requested to estimate statistics and query hints of a Join logical operator for `Inner`, `Cross`, `LeftOuter`, `RightOuter` and `FullOuter` joins.

`computeByNdv` Internal Method



computeByNdv(
  leftKey: AttributeReference,
  rightKey: AttributeReference,
  newMin: Option[Any],
  newMax: Option[Any]): (BigInt, ColumnStat)

1

2

3

4

5

6

7

8

9

computeByNdv(

leftKey: AttributeReference,

rightKey: AttributeReference,

newMin: Option[Any],

newMax: Option[Any]): (BigInt, ColumnStat)

computeByNdv…FIXME

Note	`computeByNdv` is used exclusively when `JoinEstimation` is requested for computeCardinalityAndStats

`computeCardinalityAndStats` Internal Method



computeCardinalityAndStats(
  keyPairs: Seq[(AttributeReference, AttributeReference)]): (BigInt, AttributeMap[ColumnStat])

1

2

3

4

5

6

computeCardinalityAndStats(

keyPairs: Seq[(AttributeReference, AttributeReference)]): (BigInt, AttributeMap[ColumnStat])

computeCardinalityAndStats…FIXME

Note	`computeCardinalityAndStats` is used exclusively when `JoinEstimation` is requested for estimateInnerOuterJoin

Computing Join Cardinality Using Equi-Height Histograms — `computeByHistogram` Internal Method



computeByHistogram(
  leftKey: AttributeReference,
  rightKey: AttributeReference,
  leftHistogram: Histogram,
  rightHistogram: Histogram,
  newMin: Option[Any],
  newMax: Option[Any]): (BigInt, ColumnStat)

1

2

3

4

5

6

7

8

9

10

11

computeByHistogram(

leftKey: AttributeReference,

rightKey: AttributeReference,

leftHistogram: Histogram,

rightHistogram: Histogram,

newMin: Option[Any],

newMax: Option[Any]): (BigInt, ColumnStat)

computeByHistogram…FIXME

Note	`computeByHistogram` is used exclusively when `JoinEstimation` is requested for computeCardinalityAndStats (and the histograms of both column attributes used in a join are available).

Estimating Statistics for Left Semi and Left Anti Joins — `estimateLeftSemiAntiJoin` Internal Method



estimateLeftSemiAntiJoin(): Option[Statistics]

1

2

3

4

5

estimateLeftSemiAntiJoin(): Option[Statistics]

estimateLeftSemiAntiJoin estimates statistics of the Join logical operator only when estimated row count statistic is available. Otherwise, estimateLeftSemiAntiJoin simply returns None (i.e. no statistics estimated).

Note	row count statistic of a table is available only after ANALYZE TABLE COMPUTE STATISTICS SQL command.

If available, estimateLeftSemiAntiJoin takes the estimated row count statistic of the left side of the Join operator.

Note	Use ANALYZE TABLE COMPUTE STATISTICS SQL command on the left logical plan to compute row count statistics.

Note	Use ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command on the left logical plan to generate column (equi-height) histograms for more accurate estimations.

In the end, estimateLeftSemiAntiJoin creates a new Statistics with the following estimates:

Total size (in bytes) is the output size for the output schema of the join, the row count statistic (aka output rows) and column histograms.
Row count is exactly the row count of the left side
Column histograms is exactly the column histograms of the left side

Note	`estimateLeftSemiAntiJoin` is used exclusively when `JoinEstimation` is requested to estimate statistics and query hints for `LeftSemi` and `LeftAnti` joins.

Estimating Statistics and Query Hints of Join Logical Operator — `estimate` Method



estimate: Option[Statistics]

1

2

3

4

5

estimate: Option[Statistics]

estimate estimates statistics and query hints of the Join logical operator per join type:

For Inner, Cross, LeftOuter, RightOuter and FullOuter join types, estimate estimateInnerOuterJoin
For LeftSemi and LeftAnti join types, estimate estimateLeftSemiAntiJoin

For other join types, estimate prints out the following DEBUG message to the logs and returns None (to “announce” that no statistics could be computed).



[CBO] Unsupported join type: [joinType]

1

2

3

4

5

[CBO] Unsupported join type: [joinType]

Note	`estimate` is used exclusively when `BasicStatsPlanVisitor` is requested to estimate statistics and query hints of a Join logical operator.

FilterEstimation

2012-07-19admin阅读(1452)

FilterEstimation

FilterEstimation is…FIXME

`computeEqualityPossibilityByHistogram` Internal Method



computeEqualityPossibilityByHistogram(literal: Literal, colStat: ColumnStat): Double

1

2

3

4

5

computeEqualityPossibilityByHistogram(literal: Literal, colStat: ColumnStat): Double

computeEqualityPossibilityByHistogram…FIXME

Note	`computeEqualityPossibilityByHistogram` is used when…FIXME

`computeComparisonPossibilityByHistogram` Internal Method



computeComparisonPossibilityByHistogram(op: BinaryComparison, literal: Literal, colStat: ColumnStat): Double

1

2

3

4

5

computeComparisonPossibilityByHistogram(op: BinaryComparison, literal: Literal, colStat: ColumnStat): Double

computeComparisonPossibilityByHistogram…FIXME

Note	`computeComparisonPossibilityByHistogram` is used when…FIXME

`update` Method



update(a: Attribute, stats: ColumnStat): Unit

1

2

3

4

5

update(a: Attribute, stats: ColumnStat): Unit

update…FIXME

Note	`update` is used when…FIXME

AggregateEstimation

2012-07-18admin阅读(1527)

AggregateEstimation

AggregateEstimation is…FIXME

Estimating Statistics and Query Hints of Aggregate Logical Operator — `estimate` Method



estimate(agg: Aggregate): Option[Statistics]

1

2

3

4

5

estimate(agg: Aggregate): Option[Statistics]

estimate…FIXME

Note	`estimate` is used exclusively when `BasicStatsPlanVisitor` is requested to estimate statistics and query hints of a Aggregate logical operator.

spark-sql 第38页

ClusteredDistribution

createPartitioning Method

Creating ClusteredDistribution Instance

BroadcastDistribution

createPartitioning Method

AllTuples

createPartitioning Method

Distribution — Contract For Data Distribution Across Partitions

ExchangeCoordinator

postShuffleRDD Method

Partitioning — Specification of Physical Operator’s Output Partitions

ProjectEstimation

Estimating Statistics and Query Hints of Project Logical Operator — estimate Method

JoinEstimation

estimateInnerOuterJoin Internal Method

computeByNdv Internal Method

computeCardinalityAndStats Internal Method

Computing Join Cardinality Using Equi-Height Histograms — computeByHistogram Internal Method

Estimating Statistics for Left Semi and Left Anti Joins — estimateLeftSemiAntiJoin Internal Method

Estimating Statistics and Query Hints of Join Logical Operator — estimate Method

FilterEstimation

computeEqualityPossibilityByHistogram Internal Method

computeComparisonPossibilityByHistogram Internal Method

update Method

AggregateEstimation

Estimating Statistics and Query Hints of Aggregate Logical Operator — estimate Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

`createPartitioning` Method

`createPartitioning` Method

`createPartitioning` Method

`postShuffleRDD` Method

Estimating Statistics and Query Hints of Project Logical Operator — `estimate` Method

`estimateInnerOuterJoin` Internal Method

`computeByNdv` Internal Method

`computeCardinalityAndStats` Internal Method

Computing Join Cardinality Using Equi-Height Histograms — `computeByHistogram` Internal Method

Estimating Statistics for Left Semi and Left Anti Joins — `estimateLeftSemiAntiJoin` Internal Method

Estimating Statistics and Query Hints of Join Logical Operator — `estimate` Method

`computeEqualityPossibilityByHistogram` Internal Method

`computeComparisonPossibilityByHistogram` Internal Method

`update` Method

Estimating Statistics and Query Hints of Aggregate Logical Operator — `estimate` Method