spark-sql-spark技术分享-第25页

ResolvedHint Unary Logical Operator

ResolvedHint is a unary logical operator that…FIXME

ResolvedHint is created when…FIXME

When requested for output schema, ResolvedHint uses the output of the child logical operator.

ResolvedHint simply requests the child logical operator for the canonicalized version.

Creating ResolvedHint Instance

ResolvedHint takes the following when created:

Child logical operator
Query hints

Repartition and RepartitionByExpression

2012-12-03admin阅读(5129)

Repartition Logical Operators — Repartition and RepartitionByExpression

Repartition and RepartitionByExpression (repartition operations in short) are unary logical operators that create a new RDD that has exactly numPartitions partitions.

Note	`RepartitionByExpression` is also called distribute operator.

Repartition is the result of coalesce or repartition (with no partition expressions defined) operators.



val rangeAlone = spark.range(5)

scala> rangeAlone.rdd.getNumPartitions
res0: Int = 8

// Repartition the records

val withRepartition = rangeAlone.repartition(numPartitions = 5)

scala> withRepartition.rdd.getNumPartitions
res1: Int = 5

scala> withRepartition.explain(true)
== Parsed Logical Plan ==
Repartition 5, true
+- Range (0, 5, step=1, splits=Some(8))

// ...

== Physical Plan ==
Exchange RoundRobinPartitioning(5)
+- *Range (0, 5, step=1, splits=Some(8))

// Coalesce the records

val withCoalesce = rangeAlone.coalesce(numPartitions = 5)
scala> withCoalesce.explain(true)
== Parsed Logical Plan ==
Repartition 5, false
+- Range (0, 5, step=1, splits=Some(8))

// ...

== Physical Plan ==
Coalesce 5
+- *Range (0, 5, step=1, splits=Some(8))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

val rangeAlone = spark.range(5)

scala> rangeAlone.rdd.getNumPartitions

res0: Int = 8

// Repartition the records

val withRepartition = rangeAlone.repartition(numPartitions = 5)

scala> withRepartition.rdd.getNumPartitions

res1: Int = 5

scala> withRepartition.explain(true)

== Parsed Logical Plan ==

Repartition 5, true

+- Range (0, 5, step=1, splits=Some(8))

// ...

== Physical Plan ==

Exchange RoundRobinPartitioning(5)

+- *Range (0, 5, step=1, splits=Some(8))

// Coalesce the records

val withCoalesce = rangeAlone.coalesce(numPartitions = 5)

scala> withCoalesce.explain(true)

== Parsed Logical Plan ==

Repartition 5, false

+- Range (0, 5, step=1, splits=Some(8))

// ...

== Physical Plan ==

Coalesce 5

+- *Range (0, 5, step=1, splits=Some(8))

RepartitionByExpression is the result of the following operators:

Dataset.repartition operator (with explicit partition expressions defined)
Dataset.repartitionByRange
DISTRIBUTE BY SQL clause.



// RepartitionByExpression
// 1) Column-based partition expression only
scala> rangeAlone.repartition(partitionExprs = 'id % 2).explain(true)
== Parsed Logical Plan ==
'RepartitionByExpression [('id % 2)], 200
+- Range (0, 5, step=1, splits=Some(8))

// ...

== Physical Plan ==
Exchange hashpartitioning((id#10L % 2), 200)
+- *Range (0, 5, step=1, splits=Some(8))

// 2) Explicit number of partitions and partition expression
scala> rangeAlone.repartition(numPartitions = 2, partitionExprs = 'id % 2).explain(true)
== Parsed Logical Plan ==
'RepartitionByExpression [('id % 2)], 2
+- Range (0, 5, step=1, splits=Some(8))

// ...

== Physical Plan ==
Exchange hashpartitioning((id#10L % 2), 2)
+- *Range (0, 5, step=1, splits=Some(8))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

// RepartitionByExpression

// 1) Column-based partition expression only

scala> rangeAlone.repartition(partitionExprs = 'id % 2).explain(true)

== Parsed Logical Plan ==

'RepartitionByExpression [('id % 2)], 200

+- Range (0, 5, step=1, splits=Some(8))

// ...

== Physical Plan ==

Exchange hashpartitioning((id#10L % 2), 200)

+- *Range (0, 5, step=1, splits=Some(8))

// 2) Explicit number of partitions and partition expression

scala> rangeAlone.repartition(numPartitions = 2, partitionExprs = 'id % 2).explain(true)

== Parsed Logical Plan ==

'RepartitionByExpression [('id % 2)], 2

+- Range (0, 5, step=1, splits=Some(8))

// ...

== Physical Plan ==

Exchange hashpartitioning((id#10L % 2), 2)

+- *Range (0, 5, step=1, splits=Some(8))

Repartition and RepartitionByExpression logical operators are described by:

shuffle flag
target number of partitions

Note	BasicOperators strategy resolves `Repartition` to ShuffleExchangeExec (with RoundRobinPartitioning partitioning scheme) or CoalesceExec physical operators per shuffle — enabled or not, respectively.

Note	BasicOperators strategy resolves `RepartitionByExpression` to ShuffleExchangeExec physical operator with HashPartitioning partitioning scheme.

Repartition Operation Optimizations

CollapseRepartition logical optimization collapses adjacent repartition operations.
Repartition operations allow FoldablePropagation and PushDownPredicate logical optimizations to “push through”.
PropagateEmptyRelation logical optimization may result in an empty LocalRelation for repartition operations.

Range Leaf Logical Operator

Range is a leaf logical operator that…FIXME

Project Unary Logical Operator

Project is a unary logical operator that takes the following when created:

Project named expressions
Child logical operator

Project is created to represent the following:

Dataset operators, i.e. joinWith, select (incl. selectUntyped), unionByName
KeyValueGroupedDataset operators, i.e. keys, mapValues
CreateViewCommand logical command is executed (and aliasPlan)
SQL’s SELECT queries with named expressions

Project can also appear in a logical plan after analysis or optimization phases.



// FIXME Add examples for the following operators
// Dataset.unionByName
// KeyValueGroupedDataset.mapValues
// KeyValueGroupedDataset.keys
// CreateViewCommand.aliasPlan

// joinWith operator
case class Person(id: Long, name: String, cityId: Long)
case class City(id: Long, name: String)
val family = Seq(
  Person(0, "Agata", 0),
  Person(1, "Iweta", 0),
  Person(2, "Patryk", 2),
  Person(3, "Maksym", 0)).toDS
val cities = Seq(
  City(0, "Warsaw"),
  City(1, "Washington"),
  City(2, "Sopot")).toDS
val q = family.joinWith(cities, family("cityId") === cities("id"), "inner")
scala> println(q.queryExecution.logical.numberedTreeString)
00 Join Inner, (_1#41.cityId = _2#42.id)
01 :- Project [named_struct(id, id#32L, name, name#33, cityId, cityId#34L) AS _1#41]
02 :  +- LocalRelation [id#32L, name#33, cityId#34L]
03 +- Project [named_struct(id, id#38L, name, name#39) AS _2#42]
04    +- LocalRelation [id#38L, name#39]

// select operator
val qs = spark.range(10).select($"id")
scala> println(qs.queryExecution.logical.numberedTreeString)
00 'Project [unresolvedalias('id, None)]
01 +- Range (0, 10, step=1, splits=Some(8))

// select[U1](c1: TypedColumn[T, U1])
scala> :type q
org.apache.spark.sql.Dataset[(Person, City)]

val left = $"_1".as[Person]
val ql = q.select(left)
scala> println(ql.queryExecution.logical.numberedTreeString)
00 'SerializeFromObject [assertnotnull(assertnotnull(input[0, $line14.$read$$iw$$iw$Person, true])).id AS id#87L, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, $line14.$read$$iw$$iw$Person, true])).name, true, false) AS name#88, assertnotnull(assertnotnull(input[0, $line14.$read$$iw$$iw$Person, true])).cityId AS cityId#89L]
01 +- 'MapElements <function1>, class scala.Tuple1, [StructField(_1,StructType(StructField(id,LongType,false), StructField(name,StringType,true), StructField(cityId,LongType,false)),true)], obj#86: $line14.$read$$iw$$iw$Person
02    +- 'DeserializeToObject unresolveddeserializer(newInstance(class scala.Tuple1)), obj#85: scala.Tuple1
03       +- Project [_1#44]
04          +- Join Inner, (_1#44.cityId = _2#45.id)
05             :- Project [named_struct(id, id#32L, name, name#33, cityId, cityId#34L) AS _1#44]
06             :  +- LocalRelation [id#32L, name#33, cityId#34L]
07             +- Project [named_struct(id, id#38L, name, name#39) AS _2#45]
08                +- LocalRelation [id#38L, name#39]

// SQL
spark.range(10).createOrReplaceTempView("nums")
val qn = spark.sql("select * from nums")
scala> println(qn.queryExecution.logical.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `nums`

// Examples with Project that was added during analysis
// Examples with Project that was added during optimization

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

// FIXME Add examples for the following operators

// Dataset.unionByName

// KeyValueGroupedDataset.mapValues

// KeyValueGroupedDataset.keys

// CreateViewCommand.aliasPlan

// joinWith operator

case class Person(id: Long, name: String, cityId: Long)

case class City(id: Long, name: String)

val family = Seq(

Person(0, "Agata", 0),

Person(1, "Iweta", 0),

Person(2, "Patryk", 2),

Person(3, "Maksym", 0)).toDS

val cities = Seq(

City(0, "Warsaw"),

City(1, "Washington"),

City(2, "Sopot")).toDS

val q = family.joinWith(cities, family("cityId") === cities("id"), "inner")

scala> println(q.queryExecution.logical.numberedTreeString)

00 Join Inner, (_1#41.cityId = _2#42.id)

01 :- Project [named_struct(id, id#32L, name, name#33, cityId, cityId#34L) AS _1#41]

02 : +- LocalRelation [id#32L, name#33, cityId#34L]

03 +- Project [named_struct(id, id#38L, name, name#39) AS _2#42]

04 +- LocalRelation [id#38L, name#39]

// select operator

val qs = spark.range(10).select($"id")

scala> println(qs.queryExecution.logical.numberedTreeString)

00 'Project [unresolvedalias('id, None)]

01 +- Range (0, 10, step=1, splits=Some(8))

// select[U1](c1: TypedColumn[T, U1])

scala> :type q

org.apache.spark.sql.Dataset[(Person, City)]

val left = $"_1".as[Person]

val ql = q.select(left)

scala> println(ql.queryExecution.logical.numberedTreeString)

00 'SerializeFromObject [assertnotnull(assertnotnull(input[0, $line14.$read$$iw$$iw$Person, true])).id AS id#87L, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, $line14.$read$$iw$$iw$Person, true])).name, true, false) AS name#88, assertnotnull(assertnotnull(input[0, $line14.$read$$iw$$iw$Person, true])).cityId AS cityId#89L]

01 +- 'MapElements <function1>, class scala.Tuple1, [StructField(_1,StructType(StructField(id,LongType,false), StructField(name,StringType,true), StructField(cityId,LongType,false)),true)], obj#86: $line14.$read$$iw$$iw$Person

02 +- 'DeserializeToObject unresolveddeserializer(newInstance(class scala.Tuple1)), obj#85: scala.Tuple1

03 +- Project [_1#44]

04 +- Join Inner, (_1#44.cityId = _2#45.id)

05 :- Project [named_struct(id, id#32L, name, name#33, cityId, cityId#34L) AS _1#44]

06 : +- LocalRelation [id#32L, name#33, cityId#34L]

07 +- Project [named_struct(id, id#38L, name, name#39) AS _2#45]

08 +- LocalRelation [id#38L, name#39]

// SQL

spark.range(10).createOrReplaceTempView("nums")

val qn = spark.sql("select * from nums")

scala> println(qn.queryExecution.logical.numberedTreeString)

00 'Project [*]

01 +- 'UnresolvedRelation `nums`

// Examples with Project that was added during analysis

// Examples with Project that was added during optimization

Note	Nondeterministic expressions are allowed in `Project` logical operator and enforced by CheckAnalysis.

The output schema of a Project is…FIXME

maxRows…FIXME

resolved…FIXME

validConstraints…FIXME

Tip

Use select operator from Catalyst DSL to create a Project logical operator, e.g. for testing or Spark SQL internals exploration.



import org.apache.spark.sql.catalyst.dsl.plans._  // <-- gives table and select
import org.apache.spark.sql.catalyst.dsl.expressions.star
val plan = table("a").select(star())
scala> println(plan.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `a`

1

2

3

4

5

6

7

8

9

10

import org.apache.spark.sql.catalyst.dsl.plans._ // <-- gives table and select

import org.apache.spark.sql.catalyst.dsl.expressions.star

val plan = table("a").select(star())

scala> println(plan.numberedTreeString)

00 'Project [*]

01 +- 'UnresolvedRelation `a`

Pivot

2012-11-30admin阅读(1668)

Pivot Unary Logical Operator

Pivot is a unary logical operator that represents pivot operator.



val visits = Seq(
  (0, "Warsaw", 2015),
  (1, "Warsaw", 2016),
  (2, "Boston", 2017)
).toDF("id", "city", "year")

val q = visits
  .groupBy("city")
  .pivot("year", Seq("2015", "2016", "2017"))
  .count()
scala> println(q.queryExecution.logical.numberedTreeString)
00 Pivot [city#8], year#9: int, [2015, 2016, 2017], [count(1) AS count#157L]
01 +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]
02    +- LocalRelation [_1#3, _2#4, _3#5]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

val visits = Seq(

(0, "Warsaw", 2015),

(1, "Warsaw", 2016),

(2, "Boston", 2017)

).toDF("id", "city", "year")

val q = visits

.groupBy("city")

.pivot("year", Seq("2015", "2016", "2017"))

.count()

scala> println(q.queryExecution.logical.numberedTreeString)

00 Pivot [city#8], year#9: int, [2015, 2016, 2017], [count(1) AS count#157L]

01 +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]

02 +- LocalRelation [_1#3, _2#4, _3#5]

Pivot is created when RelationalGroupedDataset creates a DataFrame for an aggregate operator.

Analysis Phase

Pivot operator is resolved at analysis phase in the following logical evaluation rules:



val spark: SparkSession = ...

import spark.sessionState.analyzer.ResolveAliases
// see q in the example above
val plan = q.queryExecution.logical

scala> println(plan.numberedTreeString)
00 Pivot [city#8], year#9: int, [2015, 2016, 2017], [count(1) AS count#24L]
01 +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]
02    +- LocalRelation [_1#3, _2#4, _3#5]

// FIXME Find a plan to show the effect of ResolveAliases
val planResolved = ResolveAliases(plan)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

val spark: SparkSession = ...

import spark.sessionState.analyzer.ResolveAliases

// see q in the example above

val plan = q.queryExecution.logical

scala> println(plan.numberedTreeString)

00 Pivot [city#8], year#9: int, [2015, 2016, 2017], [count(1) AS count#24L]

01 +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]

02 +- LocalRelation [_1#3, _2#4, _3#5]

// FIXME Find a plan to show the effect of ResolveAliases

val planResolved = ResolveAliases(plan)

Pivot operator “disappears” behind (i.e. is converted to) a Aggregate logical operator (possibly under Project operator).



import spark.sessionState.analyzer.ResolvePivot
val planAfterResolvePivot = ResolvePivot(plan)
scala> println(planAfterResolvePivot.numberedTreeString)
00 Project [city#8, __pivot_count(1) AS `count` AS `count(1) AS ``count```#62[0] AS 2015#63L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#62[1] AS 2016#64L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#62[2] AS 2017#65L]
01 +- Aggregate [city#8], [city#8, pivotfirst(year#9, count(1) AS `count`#54L, 2015, 2016, 2017, 0, 0) AS __pivot_count(1) AS `count` AS `count(1) AS ``count```#62]
02    +- Aggregate [city#8, year#9], [city#8, year#9, count(1) AS count#24L AS count(1) AS `count`#54L]
03       +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]
04          +- LocalRelation [_1#3, _2#4, _3#5]

1

2

3

4

5

6

7

8

9

10

11

12

import spark.sessionState.analyzer.ResolvePivot

val planAfterResolvePivot = ResolvePivot(plan)

scala> println(planAfterResolvePivot.numberedTreeString)

00 Project [city#8, __pivot_count(1) AS `count` AS `count(1) AS ``count```#62[0] AS 2015#63L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#62[1] AS 2016#64L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#62[2] AS 2017#65L]

01 +- Aggregate [city#8], [city#8, pivotfirst(year#9, count(1) AS `count`#54L, 2015, 2016, 2017, 0, 0) AS __pivot_count(1) AS `count` AS `count(1) AS ``count```#62]

02 +- Aggregate [city#8, year#9], [city#8, year#9, count(1) AS count#24L AS count(1) AS `count`#54L]

03 +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]

04 +- LocalRelation [_1#3, _2#4, _3#5]

Creating Pivot Instance

Pivot takes the following when created:

Grouping named expressions
Pivot column expression
Pivot values literals
Aggregation expressions
Child logical plan

OneRowRelation

2012-11-29admin阅读(1878)

OneRowRelation Leaf Logical Operator

OneRowRelation is a leaf logical operator that…FIXME

LogicalRelation

2012-11-28admin阅读(1869)

LogicalRelation Leaf Logical Operator — Representing BaseRelations in Logical Plan

LogicalRelation is a leaf logical operator that represents a BaseRelation in a logical query plan.



val q1 = spark.read.option("header", true).csv("../datasets/people.csv")
scala> println(q1.queryExecution.logical.numberedTreeString)
00 Relation[id#72,name#73,age#74] csv

val q2 = sql("select * from `csv`.`../datasets/people.csv`")
scala> println(q2.queryExecution.optimizedPlan.numberedTreeString)
00 Relation[_c0#175,_c1#176,_c2#177] csv

1

2

3

4

5

6

7

8

9

10

11

val q1 = spark.read.option("header", true).csv("../datasets/people.csv")

scala> println(q1.queryExecution.logical.numberedTreeString)

00 Relation[id#72,name#73,age#74] csv

val q2 = sql("select * from `csv`.`../datasets/people.csv`")

scala> println(q2.queryExecution.optimizedPlan.numberedTreeString)

00 Relation[_c0#175,_c1#176,_c2#177] csv

LogicalRelation is created when:

DataFrameReader loads data from a data source that supports multiple paths (through SparkSession.baseRelationToDataFrame)
DataFrameReader is requested to load data from an external table using JDBC (through SparkSession.baseRelationToDataFrame)
TextInputCSVDataSource and TextInputJsonDataSource are requested to infer schema
ResolveSQLOnFile converts a logical plan
FindDataSourceTable logical evaluation rule is executed
RelationConversions logical evaluation rule is executed
CreateTempViewUsing logical command is requested to run
Structured Streaming’s FileStreamSource creates batches of records

Note

LogicalRelation can be created using apply factory methods that accept BaseRelation with optional CatalogTable.



apply(relation: BaseRelation): LogicalRelation
apply(relation: BaseRelation, table: CatalogTable): LogicalRelation

1

2

3

4

5

6

apply(relation: BaseRelation): LogicalRelation

apply(relation: BaseRelation, table: CatalogTable): LogicalRelation

The simple text representation of a LogicalRelation (aka simpleString) is Relation[output] [relation] (that uses the output and BaseRelation).



val q = spark.read.text("README.md")
val logicalPlan = q.queryExecution.logical

scala> println(logicalPlan.simpleString)
Relation[value#2] text

1

2

3

4

5

6

7

8

9

val q = spark.read.text("README.md")

val logicalPlan = q.queryExecution.logical

scala> println(logicalPlan.simpleString)

Relation[value#2] text

`refresh` Method



refresh(): Unit

1

2

3

4

5

refresh(): Unit

Note	`refresh` is part of LogicalPlan Contract to refresh itself.

refresh requests the FileIndex of a HadoopFsRelation relation to refresh.

Note	`refresh` does the work for HadoopFsRelation relations only.

Creating LogicalRelation Instance

LogicalRelation takes the following when created:

BaseRelation
Output schema AttributeReferences
Optional CatalogTable

LogicalRDD

2012-11-27admin阅读(1647)

LogicalRDD — Logical Scan Over RDD

LogicalRDD is a leaf logical operator with MultiInstanceRelation support for a logical representation of a scan over RDD of internal binary rows.

LogicalRDD is created when:

Dataset is requested to checkpoint
SparkSession is requested to create a DataFrame from an RDD of internal binary rows

Note	`LogicalRDD` is resolved to RDDScanExec when `BasicOperators` execution planning strategy is executed.

`newInstance` Method



newInstance(): LogicalRDD.this.type

1

2

3

4

5

newInstance(): LogicalRDD.this.type

Note	`newInstance` is part of MultiInstanceRelation Contract to…FIXME.

newInstance…FIXME

Computing Statistics — `computeStats` Method



computeStats(): Statistics

1

2

3

4

5

computeStats(): Statistics

Note	`computeStats` is part of LeafNode Contract to compute statistics for cost-based optimizer.

computeStats…FIXME

Creating LogicalRDD Instance

LogicalRDD takes the following when created:

Output schema attributes
RDD of internal binary rows
Partitioning
Output ordering (SortOrder)
isStreaming flag
SparkSession

LocalRelation

2012-11-26admin阅读(1766)

LocalRelation Leaf Logical Operator

LocalRelation is a leaf logical operator that allow functions like collect or take to be executed locally, i.e. without using Spark executors.

LocalRelation is created when…FIXME

Note	When `Dataset` operators can be executed locally, the `Dataset` is considered local.

LocalRelation represents Datasets that were created from local collections using SparkSession.emptyDataset or SparkSession.createDataset methods and their derivatives like toDF.



val dataset = Seq(1).toDF
scala> dataset.explain(true)
== Parsed Logical Plan ==
LocalRelation [value#216]

== Analyzed Logical Plan ==
value: int
LocalRelation [value#216]

== Optimized Logical Plan ==
LocalRelation [value#216]

== Physical Plan ==
LocalTableScan [value#216]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

val dataset = Seq(1).toDF

scala> dataset.explain(true)

== Parsed Logical Plan ==

LocalRelation [value#216]

== Analyzed Logical Plan ==

value: int

LocalRelation [value#216]

== Optimized Logical Plan ==

LocalRelation [value#216]

== Physical Plan ==

LocalTableScan [value#216]

It can only be constructed with the output attributes being all resolved.

The size of the objects (in statistics) is the sum of the default size of the attributes multiplied by the number of records.

When executed, LocalRelation is translated to LocalTableScanExec physical operator.

Creating LocalRelation Instance

LocalRelation takes the following when created:

Output schema attributes
Collection of internal binary rows
isStreaming flag that indicates whether the data comes from a streaming source (disabled by default)

LocalRelation initializes the internal registries and counters.

LeafNode

2012-11-25admin阅读(1665)

LeafNode — Base Logical Operator with No Child Operators and Optional Statistics

LeafNode is the base of logical operators that have no child operators.

LeafNode that wants to survive analysis has to define computeStats as it throws an UnsupportedOperationException by default.

Table 1. LeafNodes (Direct Implementations)
LeafNode	Description
AnalysisBarrier
DataSourceV2Relation
ExternalRDD
HiveTableRelation
InMemoryRelation
LocalRelation
LogicalRDD
LogicalRelation
OneRowRelation
Range
UnresolvedCatalogRelation
UnresolvedInlineTable
UnresolvedRelation
UnresolvedTableValuedFunction

Computing Statistics — `computeStats` Method



computeStats(): Statistics

1

2

3

4

5

computeStats(): Statistics

computeStats simply throws an UnsupportedOperationException.

Note	Logical operators, e.g. ExternalRDD, LogicalRDD and `DataSourceV2Relation`, or relations, e.g. `HadoopFsRelation` or `BaseRelation`, use spark.sql.defaultSizeInBytes internal property for the default estimated size if the statistics could not be computed.

Note	`computeStats` is used exclusively when `SizeInBytesOnlyStatsPlanVisitor` uses the default case to compute the size statistic (in bytes) for a logical operator.

spark-sql 第25页

ResolvedHint

ResolvedHint Unary Logical Operator

Creating ResolvedHint Instance

Repartition and RepartitionByExpression

Repartition Logical Operators — Repartition and RepartitionByExpression

Repartition Operation Optimizations

Range

Range Leaf Logical Operator

Project

Project Unary Logical Operator

Pivot

Pivot Unary Logical Operator

Analysis Phase

Creating Pivot Instance

OneRowRelation

OneRowRelation Leaf Logical Operator

LogicalRelation

LogicalRelation Leaf Logical Operator — Representing BaseRelations in Logical Plan

`refresh` Method

Creating LogicalRelation Instance

LogicalRDD

LogicalRDD — Logical Scan Over RDD

`newInstance` Method

Computing Statistics — `computeStats` Method

Creating LogicalRDD Instance

LocalRelation

LocalRelation Leaf Logical Operator

Creating LocalRelation Instance

LeafNode

LeafNode — Base Logical Operator with No Child Operators and Optional Statistics

Computing Statistics — `computeStats` Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

spark-sql 第25页

ResolvedHint Unary Logical Operator

Creating ResolvedHint Instance

Repartition Logical Operators — Repartition and RepartitionByExpression

Repartition Operation Optimizations

Range Leaf Logical Operator

Project Unary Logical Operator

Pivot Unary Logical Operator

Analysis Phase

Creating Pivot Instance

OneRowRelation Leaf Logical Operator

LogicalRelation Leaf Logical Operator — Representing BaseRelations in Logical Plan

refresh Method

Creating LogicalRelation Instance

LogicalRDD — Logical Scan Over RDD

newInstance Method

Computing Statistics — computeStats Method

Creating LogicalRDD Instance

LocalRelation Leaf Logical Operator

Creating LocalRelation Instance

LeafNode — Base Logical Operator with No Child Operators and Optional Statistics

Computing Statistics — computeStats Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

`refresh` Method

`newInstance` Method

Computing Statistics — `computeStats` Method

Computing Statistics — `computeStats` Method