spark-sql-spark技术分享-第29页

CreateDataSourceTableAsSelectCommand

2012-10-25admin阅读(1830)

CreateDataSourceTableAsSelectCommand Logical Command

CreateDataSourceTableAsSelectCommand is a logical command that FIXME.

Executing Logical Command — `run` Method



run(session: SparkSession): Seq[Row]

1

2

3

4

5

run(session: SparkSession): Seq[Row]

Note	`run` is part of RunnableCommand Contract to execute (run) a logical command.

run…FIXME

ClearCacheCommand Logical Command

ClearCacheCommand is a logical command to remove all cached tables from the in-memory cache.

ClearCacheCommand corresponds to CLEAR CACHE SQL statement.

Note	`ClearCacheCommand` is described by `clearCache` labeled alternative in `statement` expression in `SqlBase.g4` and parsed using SparkSqlParser.

AnalyzeTableCommand Logical Command — Computing Table-Level Statistics (Total Size and Row Count)

AnalyzeTableCommand is a logical command that computes statistics (i.e. total size and row count) for a table and stores the stats in a metastore.

AnalyzeTableCommand is created exclusively for ANALYZE TABLE with no PARTITION specification and FOR COLUMNS clause.



// Seq((0, 0, "zero"), (1, 1, "one")).toDF("id", "p1", "p2").write.partitionBy("p1", "p2").saveAsTable("t1")
val sqlText = "ANALYZE TABLE t1 COMPUTE STATISTICS NOSCAN"
val plan = spark.sql(sqlText).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeTableCommand
val cmd = plan.asInstanceOf[AnalyzeTableCommand]
scala> println(cmd)
AnalyzeTableCommand `t1`, false

1

2

3

4

5

6

7

8

9

10

11

// Seq((0, 0, "zero"), (1, 1, "one")).toDF("id", "p1", "p2").write.partitionBy("p1", "p2").saveAsTable("t1")

val sqlText = "ANALYZE TABLE t1 COMPUTE STATISTICS NOSCAN"

val plan = spark.sql(sqlText).queryExecution.logical

import org.apache.spark.sql.execution.command.AnalyzeTableCommand

val cmd = plan.asInstanceOf[AnalyzeTableCommand]

scala> println(cmd)

AnalyzeTableCommand `t1`, false

Executing Logical Command (Computing Table-Level Statistics and Altering Metastore) — `run` Method



run(sparkSession: SparkSession): Seq[Row]

1

2

3

4

5

run(sparkSession: SparkSession): Seq[Row]

Note	`run` is part of RunnableCommand Contract to execute (run) a logical command.

run requests the session-specific SessionCatalog for the metadata of the table and makes sure that it is not a view (aka temporary table).

Note	`run` uses the input `SparkSession` to access the session-specific SessionState that in turn gives access to the current SessionCatalog.

run computes the total size and, without NOSCAN flag, the row count statistics of the table.

Note	`run` uses `SparkSession` to find the table in a metastore.

In the end, run alters table statistics if different from the existing table statistics in metastore.

run throws a AnalysisException when executed on a view.



ANALYZE TABLE is not supported on views.

1

2

3

4

5

ANALYZE TABLE is not supported on views.

Note

Row count statistics triggers a Spark job to count the number of rows in a table (that happens with ANALYZE TABLE with no NOSCAN flag).



// Seq((0, 0, "zero"), (1, 1, "one")).toDF("id", "p1", "p2").write.partitionBy("p1", "p2").saveAsTable("t1")
val sqlText = "ANALYZE TABLE t1 COMPUTE STATISTICS"
val plan = spark.sql(sqlText).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeTableCommand
val cmd = plan.asInstanceOf[AnalyzeTableCommand]
scala> println(cmd)
AnalyzeTableCommand `t1`, false

// Execute ANALYZE TABLE
// Check out web UI's Jobs tab for the number of Spark jobs
// http://localhost:4040/jobs/
spark.sql(sqlText).show

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

// Seq((0, 0, "zero"), (1, 1, "one")).toDF("id", "p1", "p2").write.partitionBy("p1", "p2").saveAsTable("t1")

val sqlText = "ANALYZE TABLE t1 COMPUTE STATISTICS"

val plan = spark.sql(sqlText).queryExecution.logical

import org.apache.spark.sql.execution.command.AnalyzeTableCommand

val cmd = plan.asInstanceOf[AnalyzeTableCommand]

scala> println(cmd)

AnalyzeTableCommand `t1`, false

// Execute ANALYZE TABLE

// Check out web UI's Jobs tab for the number of Spark jobs

// http://localhost:4040/jobs/

spark.sql(sqlText).show

Creating AnalyzeTableCommand Instance

AnalyzeTableCommand takes the following when created:

TableIdentifier
noscan flag (enabled by default) that indicates whether NOSCAN option was used or not

AnalyzePartitionCommand Logical Command — Computing Partition-Level Statistics (Total Size and Row Count)

AnalyzePartitionCommand is a logical command that computes statistics (i.e. total size and row count) for table partitions and stores the stats in a metastore.

AnalyzePartitionCommand is created exclusively for ANALYZE TABLE with PARTITION specification only (i.e. no FOR COLUMNS clause).



// Seq((0, 0, "zero"), (1, 1, "one")).toDF("id", "p1", "p2").write.partitionBy("p1", "p2").saveAsTable("t1")
val analyzeTable = "ANALYZE TABLE t1 PARTITION (p1, p2) COMPUTE STATISTICS"
val plan = spark.sql(analyzeTable).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzePartitionCommand
val cmd = plan.asInstanceOf[AnalyzePartitionCommand]
scala> println(cmd)
AnalyzePartitionCommand `t1`, Map(p1 -> None, p2 -> None), false

1

2

3

4

5

6

7

8

9

10

11

// Seq((0, 0, "zero"), (1, 1, "one")).toDF("id", "p1", "p2").write.partitionBy("p1", "p2").saveAsTable("t1")

val analyzeTable = "ANALYZE TABLE t1 PARTITION (p1, p2) COMPUTE STATISTICS"

val plan = spark.sql(analyzeTable).queryExecution.logical

import org.apache.spark.sql.execution.command.AnalyzePartitionCommand

val cmd = plan.asInstanceOf[AnalyzePartitionCommand]

scala> println(cmd)

AnalyzePartitionCommand `t1`, Map(p1 -> None, p2 -> None), false

Executing Logical Command (Computing Partition-Level Statistics and Altering Metastore) — `run` Method



run(sparkSession: SparkSession): Seq[Row]

1

2

3

4

5

run(sparkSession: SparkSession): Seq[Row]

Note	`run` is part of RunnableCommand Contract to execute (run) a logical command.

run requests the session-specific SessionCatalog for the metadata of the table and makes sure that it is not a view.

Note	`run` uses the input `SparkSession` to access the session-specific SessionState that in turn is used to access the current SessionCatalog.

run getPartitionSpec.

run requests the session-specific SessionCatalog for the partitions per the partition specification.

run finishes when the table has no partitions defined in a metastore.

run computes row count statistics per partition unless noscan flag was enabled.

run calculates total size (in bytes) (aka partition location size) for every table partition and creates a CatalogStatistics with the current statistics if different from the statistics recorded in the metastore (with a new row count statistic computed earlier).

In the end, run alters table partition metadata for partitions with the statistics changed.

run reports a NoSuchPartitionException when partitions do not match the metastore.

run reports an AnalysisException when executed on a view.



ANALYZE TABLE is not supported on views.

1

2

3

4

5

ANALYZE TABLE is not supported on views.

Computing Row Count Statistics Per Partition — `calculateRowCountsPerPartition` Internal Method



calculateRowCountsPerPartition(
  sparkSession: SparkSession,
  tableMeta: CatalogTable,
  partitionValueSpec: Option[TablePartitionSpec]): Map[TablePartitionSpec, BigInt]

1

2

3

4

5

6

7

8

calculateRowCountsPerPartition(

sparkSession: SparkSession,

tableMeta: CatalogTable,

partitionValueSpec: Option[TablePartitionSpec]): Map[TablePartitionSpec, BigInt]

calculateRowCountsPerPartition…FIXME

Note	`calculateRowCountsPerPartition` is used exclusively when `AnalyzePartitionCommand` is executed.

`getPartitionSpec` Internal Method



getPartitionSpec(table: CatalogTable): Option[TablePartitionSpec]

1

2

3

4

5

getPartitionSpec(table: CatalogTable): Option[TablePartitionSpec]

getPartitionSpec…FIXME

Note	`getPartitionSpec` is used exclusively when `AnalyzePartitionCommand` is executed.

Creating AnalyzePartitionCommand Instance

AnalyzePartitionCommand takes the following when created:

TableIdentifier
Partition specification
noscan flag (enabled by default) that indicates whether NOSCAN option was used or not

AnalyzeColumnCommand Logical Command for ANALYZE TABLE…COMPUTE STATISTICS FOR COLUMNS SQL Command

AnalyzeColumnCommand is a logical command for ANALYZE TABLE with FOR COLUMNS clause (and no PARTITION specification).



ANALYZE TABLE tableName COMPUTE STATISTICS FOR COLUMNS columnNames

1

2

3

4

5

ANALYZE TABLE tableName COMPUTE STATISTICS FOR COLUMNS columnNames



// Make the example reproducible
val tableName = "t1"
import org.apache.spark.sql.catalyst.TableIdentifier
val tableId = TableIdentifier(tableName)

val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tableId, ignoreIfNotExists = true, purge = true)

val df = Seq((0, 0.0, "zero"), (1, 1.4, "one")).toDF("id", "p1", "p2")
df.write.saveAsTable("t1")

// AnalyzeColumnCommand represents ANALYZE TABLE...FOR COLUMNS SQL command
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS $allCols"
val plan = spark.sql(analyzeTableSQL).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeColumnCommand
val cmd = plan.asInstanceOf[AnalyzeColumnCommand]
scala> println(cmd)
AnalyzeColumnCommand `t1`, [id, p1, p2]

spark.sql(analyzeTableSQL)
val stats = sessionCatalog.getTableMetadata(tableId).stats.get
scala> println(stats.simpleString)
1421 bytes, 2 rows

scala> stats.colStats.map { case (c, ss) => s"$c: $ss" }.foreach(println)
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0.0),Some(1.4),0,8,8,None)
p2: ColumnStat(2,None,None,0,4,4,None)

// Use DESC EXTENDED for friendlier output
scala> sql(s"DESC EXTENDED $tableName id").show
+--------------+----------+
|     info_name|info_value|
+--------------+----------+
|      col_name|        id|
|     data_type|       int|
|       comment|      NULL|
|           min|         0|
|           max|         1|
|     num_nulls|         0|
|distinct_count|         2|
|   avg_col_len|         4|
|   max_col_len|         4|
|     histogram|      NULL|
+--------------+----------+

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

// Make the example reproducible

val tableName = "t1"

import org.apache.spark.sql.catalyst.TableIdentifier

val tableId = TableIdentifier(tableName)

val sessionCatalog = spark.sessionState.catalog

sessionCatalog.dropTable(tableId, ignoreIfNotExists = true, purge = true)

val df = Seq((0, 0.0, "zero"), (1, 1.4, "one")).toDF("id", "p1", "p2")

df.write.saveAsTable("t1")

// AnalyzeColumnCommand represents ANALYZE TABLE...FOR COLUMNS SQL command

val allCols = df.columns.mkString(",")

val analyzeTableSQL = s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS $allCols"

val plan = spark.sql(analyzeTableSQL).queryExecution.logical

import org.apache.spark.sql.execution.command.AnalyzeColumnCommand

val cmd = plan.asInstanceOf[AnalyzeColumnCommand]

scala> println(cmd)

AnalyzeColumnCommand `t1`, [id, p1, p2]

spark.sql(analyzeTableSQL)

val stats = sessionCatalog.getTableMetadata(tableId).stats.get

scala> println(stats.simpleString)

1421 bytes, 2 rows

scala> stats.colStats.map { case (c, ss) => s"$c: $ss" }.foreach(println)

id: ColumnStat(2,Some(0),Some(1),0,4,4,None)

p1: ColumnStat(2,Some(0.0),Some(1.4),0,8,8,None)

p2: ColumnStat(2,None,None,0,4,4,None)

// Use DESC EXTENDED for friendlier output

scala> sql(s"DESC EXTENDED $tableName id").show

+--------------+----------+

| info_name|info_value|

+--------------+----------+

| col_name| id|

| data_type| int|

| comment| NULL|

| min| 0|

| max| 1|

| num_nulls| 0|

|distinct_count| 2|

| avg_col_len| 4|

| max_col_len| 4|

| histogram| NULL|

+--------------+----------+

AnalyzeColumnCommand can generate column histograms when spark.sql.statistics.histogram.enabled configuration property is turned on (which is disabled by default). AnalyzeColumnCommand supports column histograms for the following data types:

IntegralType
DecimalType
DoubleType
FloatType
DateType
TimestampType

Note

Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan.



// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true
// Use the above example to set up the environment
// Make sure that ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS was run with histogram enabled

// There are 254 bins by default
// Use spark.sql.statistics.histogram.numBins to control the bins
val descExtSQL = s"DESC EXTENDED $tableName p1"
scala> spark.sql(descExtSQL).show(truncate = false)
+--------------+-----------------------------------------------------+
|info_name     |info_value                                           |
+--------------+-----------------------------------------------------+
|col_name      |p1                                                   |
|data_type     |double                                               |
|comment       |NULL                                                 |
|min           |0.0                                                  |
|max           |1.4                                                  |
|num_nulls     |0                                                    |
|distinct_count|2                                                    |
|avg_col_len   |8                                                    |
|max_col_len   |8                                                    |
|histogram     |height: 0.007874015748031496, num_of_bins: 254       |
|bin_0         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_1         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_2         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_3         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_4         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_5         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_6         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_7         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_8         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_9         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
+--------------+-----------------------------------------------------+
only showing top 20 rows

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true

// Use the above example to set up the environment

// Make sure that ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS was run with histogram enabled

// There are 254 bins by default

// Use spark.sql.statistics.histogram.numBins to control the bins

val descExtSQL = s"DESC EXTENDED $tableName p1"

scala> spark.sql(descExtSQL).show(truncate = false)

+--------------+-----------------------------------------------------+

|info_name |info_value |

+--------------+-----------------------------------------------------+

|col_name |p1 |

|data_type |double |

|comment |NULL |

|min |0.0 |

|max |1.4 |

|num_nulls |0 |

|distinct_count|2 |

|avg_col_len |8 |

|max_col_len |8 |

|histogram |height: 0.007874015748031496, num_of_bins: 254 |

|bin_0 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_1 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_2 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_3 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_4 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_5 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_6 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_7 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_8 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_9 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

+--------------+-----------------------------------------------------+

only showing top 20 rows

Note	`AnalyzeColumnCommand` is described by `analyze` labeled alternative in `statement` expression in `SqlBase.g4` and parsed using SparkSqlAstBuilder.

Note	`AnalyzeColumnCommand` is not supported on views.

Executing Logical Command — `run` Method



run(sparkSession: SparkSession): Seq[Row]

1

2

3

4

5

run(sparkSession: SparkSession): Seq[Row]

Note	`run` is part of RunnableCommand Contract to execute (run) a logical command.

run calculates the following statistics:

sizeInBytes
stats for each column

Caution

FIXME

Computing Statistics for Specified Columns — `computeColumnStats` Internal Method



computeColumnStats(
  sparkSession: SparkSession,
  tableIdent: TableIdentifier,
  columnNames: Seq[String]): (Long, Map[String, ColumnStat])

1

2

3

4

5

6

7

8

computeColumnStats(

sparkSession: SparkSession,

tableIdent: TableIdentifier,

columnNames: Seq[String]): (Long, Map[String, ColumnStat])

computeColumnStats…FIXME

Note	`computeColumnStats` is used exclusively when `AnalyzeColumnCommand` is executed.

`computePercentiles` Internal Method



computePercentiles(
  attributesToAnalyze: Seq[Attribute],
  sparkSession: SparkSession,
  relation: LogicalPlan): AttributeMap[ArrayData]

1

2

3

4

5

6

7

8

computePercentiles(

attributesToAnalyze: Seq[Attribute],

sparkSession: SparkSession,

relation: LogicalPlan): AttributeMap[ArrayData]

computePercentiles…FIXME

Note	`computePercentiles` is used exclusively when `AnalyzeColumnCommand` is executed (and computes column statistics).

Creating AnalyzeColumnCommand Instance

AnalyzeColumnCommand takes the following when created:

TableIdentifier
Column names

AnalysisBarrier

2012-10-20admin阅读(2003)

AnalysisBarrier Leaf Logical Operator — Hiding Child Query Plan in Analysis

AnalysisBarrier is a leaf logical operator that is a wrapper of an analyzed logical plan to hide it from the Spark Analyzer. The purpose of AnalysisBarrier is to prevent the child logical plan from being analyzed again (and increasing the time spent on query analysis).

AnalysisBarrier is created when:

ResolveReferences logical resolution rule is requested to dedupRight
ResolveMissingReferences logical resolution rule is requested to resolveExprsAndAddMissingAttrs
Dataset is created
DataFrameWriter is requested to execute a logical command for writing to a data source V1 (when DataFrameWriter is requested to save the rows of a structured query (a DataFrame) to a data source)
KeyValueGroupedDataset is requested for the logical query plan

AnalysisBarrier takes a single child logical query plan when created.

AnalysisBarrier returns the child logical query plan when requested for the inner nodes (that should be shown as an inner nested tree of this node).

AnalysisBarrier simply requests the child logical query plan for the output schema attributes.

AnalysisBarrier simply requests the child logical query plan for the isStreaming flag.

AnalysisBarrier simply requests the child logical operator for the canonicalized version.

AlterViewAsCommand

2012-10-19admin阅读(1636)

AlterViewAsCommand Logical Command

AlterViewAsCommand is a logical command for ALTER VIEW SQL statement to alter a view.

AlterViewAsCommand works with a table identifier (as TableIdentifier), the original SQL text, and a LogicalPlan for the SQL query.

Note	`AlterViewAsCommand` is described by `alterViewQuery` labeled alternative in `statement` expression in `SqlBase.g4` and parsed using SparkSqlParser.

When executed, AlterViewAsCommand attempts to alter a temporary view in the current SessionCatalog first, and if that “fails”, alters the permanent view.

Executing Logical Command — `run` Method



run(session: SparkSession): Seq[Row]

1

2

3

4

5

run(session: SparkSession): Seq[Row]

Note	`run` is part of RunnableCommand Contract to execute (run) a logical command.

run…FIXME

`alterPermanentView` Internal Method



alterPermanentView(session: SparkSession, analyzedPlan: LogicalPlan): Unit

1

2

3

4

5

alterPermanentView(session: SparkSession, analyzedPlan: LogicalPlan): Unit

alterPermanentView…FIXME

Note	`alterPermanentView` is used when…FIXME

Aggregate

2012-10-18admin阅读(1753)

Aggregate Unary Logical Operator

Aggregate is a unary logical operator that holds the following:

Grouping expressions
Aggregate named expressions
Child logical plan

Aggregate is created to represent the following (after a logical plan is analyzed):

SQL’s GROUP BY clause (possibly with WITH CUBE or WITH ROLLUP)
RelationalGroupedDataset aggregations (e.g. pivot)
KeyValueGroupedDataset aggregations
AnalyzeColumnCommand logical command

Note	`Aggregate` logical operator is translated to one of HashAggregateExec, ObjectHashAggregateExec or SortAggregateExec physical operators in Aggregation execution planning strategy.

Name Description

maxRows

Child logical plan‘s maxRows

Note	Part of LogicalPlan contract.

output

Attributes of aggregate named expressions

Note	Part of QueryPlan contract.

resolved

Enabled when:

expressions and child logical plan are resolved
No WindowExpressions exist in aggregate named expressions

Note	Part of LogicalPlan contract.

validConstraints

The (expression) constraints of child logical plan and non-aggregate aggregate named expressions.

Note	Part of QueryPlan contract.

Rule-Based Logical Query Optimization Phase

PushDownPredicate logical plan optimization applies so-called filter pushdown to a Pivot operator when under Filter operator and with all expressions deterministic.



import org.apache.spark.sql.catalyst.optimizer.PushDownPredicate

val q = visits
  .groupBy("city")
  .pivot("year")
  .count()
  .where($"city" === "Boston")

val pivotPlanAnalyzed = q.queryExecution.analyzed
scala> println(pivotPlanAnalyzed.numberedTreeString)
00 Filter (city#8 = Boston)
01 +- Project [city#8, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[0] AS 2015#143L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[1] AS 2016#144L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[2] AS 2017#145L]
02    +- Aggregate [city#8], [city#8, pivotfirst(year#9, count(1) AS `count`#134L, 2015, 2016, 2017, 0, 0) AS __pivot_count(1) AS `count` AS `count(1) AS ``count```#142]
03       +- Aggregate [city#8, year#9], [city#8, year#9, count(1) AS count(1) AS `count`#134L]
04          +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]
05             +- LocalRelation [_1#3, _2#4, _3#5]

val afterPushDown = PushDownPredicate(pivotPlanAnalyzed)
scala> println(afterPushDown.numberedTreeString)
00 Project [city#8, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[0] AS 2015#143L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[1] AS 2016#144L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[2] AS 2017#145L]
01 +- Aggregate [city#8], [city#8, pivotfirst(year#9, count(1) AS `count`#134L, 2015, 2016, 2017, 0, 0) AS __pivot_count(1) AS `count` AS `count(1) AS ``count```#142]
02    +- Aggregate [city#8, year#9], [city#8, year#9, count(1) AS count(1) AS `count`#134L]
03       +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]
04          +- Filter (_2#4 = Boston)
05             +- LocalRelation [_1#3, _2#4, _3#5]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

import org.apache.spark.sql.catalyst.optimizer.PushDownPredicate

val q = visits

.groupBy("city")

.pivot("year")

.count()

.where($"city" === "Boston")

val pivotPlanAnalyzed = q.queryExecution.analyzed

scala> println(pivotPlanAnalyzed.numberedTreeString)

00 Filter (city#8 = Boston)

01 +- Project [city#8, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[0] AS 2015#143L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[1] AS 2016#144L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[2] AS 2017#145L]

02 +- Aggregate [city#8], [city#8, pivotfirst(year#9, count(1) AS `count`#134L, 2015, 2016, 2017, 0, 0) AS __pivot_count(1) AS `count` AS `count(1) AS ``count```#142]

03 +- Aggregate [city#8, year#9], [city#8, year#9, count(1) AS count(1) AS `count`#134L]

04 +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]

05 +- LocalRelation [_1#3, _2#4, _3#5]

val afterPushDown = PushDownPredicate(pivotPlanAnalyzed)

scala> println(afterPushDown.numberedTreeString)

00 Project [city#8, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[0] AS 2015#143L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[1] AS 2016#144L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#142[2] AS 2017#145L]

01 +- Aggregate [city#8], [city#8, pivotfirst(year#9, count(1) AS `count`#134L, 2015, 2016, 2017, 0, 0) AS __pivot_count(1) AS `count` AS `count(1) AS ``count```#142]

02 +- Aggregate [city#8, year#9], [city#8, year#9, count(1) AS count(1) AS `count`#134L]

03 +- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]

04 +- Filter (_2#4 = Boston)

05 +- LocalRelation [_1#3, _2#4, _3#5]

SaveAsHiveFile Contract — DataWritingCommands That Write Query Result As Hive Files

2012-10-17admin阅读(1641)

SaveAsHiveFile Contract — DataWritingCommands That Write Query Result As Hive Files

SaveAsHiveFile is the extension of the DataWritingCommand contract for commands that saveAsHiveFile.

Table 1. SaveAsHiveFiles
SaveAsHiveFile	Description
InsertIntoHiveDirCommand
InsertIntoHiveTable

`saveAsHiveFile` Method



saveAsHiveFile(
  sparkSession: SparkSession,
  plan: SparkPlan,
  hadoopConf: Configuration,
  fileSinkConf: FileSinkDesc,
  outputLocation: String,
  customPartitionLocations: Map[TablePartitionSpec, String] = Map.empty,
  partitionAttributes: Seq[Attribute] = Nil): Set[String]

1

2

3

4

5

6

7

8

9

10

11

12

saveAsHiveFile(

sparkSession: SparkSession,

plan: SparkPlan,

hadoopConf: Configuration,

fileSinkConf: FileSinkDesc,

outputLocation: String,

customPartitionLocations: Map[TablePartitionSpec, String] = Map.empty,

partitionAttributes: Seq[Attribute] = Nil): Set[String]

saveAsHiveFile…FIXME

Note	`saveAsHiveFile` is used when: InsertIntoHiveDirCommand logical command is executed InsertIntoHiveTable logical command is executed (and does processInsert)

DataWritingCommand Contract — Logical Commands That Write Data

2012-10-16admin阅读(1769)

DataWritingCommand Contract — Logical Commands That Write Data

DataWritingCommand is an extension of the Command contract for logical commands that write query data to a relation when executed.

DataWritingCommand is resolved to a DataWritingCommandExec physical operator when BasicOperators execution planning strategy is executed (i.e. plan a logical plan to a physical plan).

Property Description

outputColumnNames



outputColumnNames: Seq[String]

1

2

3

4

5

outputColumnNames: Seq[String]

The output column names of the analyzed input query plan

Used when DataWritingCommand is requested for the outputColumns

query



query: LogicalPlan

1

2

3

4

5

query: LogicalPlan

The logical query plan representing the data to write (i.e. whose result will be inserted into a relation)

Used when DataWritingCommand is requested for the child nodes and outputColumns.

run



run(sparkSession: SparkSession, child: SparkPlan): Seq[Row]

1

2

3

4

5

run(sparkSession: SparkSession, child: SparkPlan): Seq[Row]

Executes the command

Used when:

DataWritingCommandExec physical operator is requested for the sideEffectResult
DataSource is requested to write data to a data source per save mode followed by reading rows back (when CreateDataSourceTableAsSelectCommand logical command is executed)

When requested for the child nodes, DataWritingCommand simply returns the logical query plan.

DataWritingCommand defines custom performance metrics.

Table 2. DataWritingCommand’s Performance Metrics
Key	Name (in web UI)	Description
`numFiles`	number of written files
`numOutputBytes`	bytes of written output
`numOutputRows`	number of output rows
`numParts`	number of dynamic part

The performance metrics are used when:

DataWritingCommand is requested for the BasicWriteJobStatsTracker
DataWritingCommandExec physical operator is requested for the metrics

Table 3. DataWritingCommands (Direct Implementations)
DataWritingCommand	Description
CreateDataSourceTableAsSelectCommand
CreateHiveTableAsSelectCommand
InsertIntoHadoopFsRelationCommand
SaveAsHiveFile	Contract for commands that write query result as Hive files

`basicWriteJobStatsTracker` Method



basicWriteJobStatsTracker(hadoopConf: Configuration): BasicWriteJobStatsTracker

1

2

3

4

5

basicWriteJobStatsTracker(hadoopConf: Configuration): BasicWriteJobStatsTracker

basicWriteJobStatsTracker simply creates and returns a new BasicWriteJobStatsTracker (with the given Hadoop Configuration and the metrics).

Note	`basicWriteJobStatsTracker` is used when: `SaveAsHiveFile` is requested to saveAsHiveFile (when InsertIntoHiveDirCommand and InsertIntoHiveTable logical commands are executed) InsertIntoHadoopFsRelationCommand logical command is executed

Output Columns — `outputColumns` Method



outputColumns: Seq[Attribute]

1

2

3

4

5

outputColumns: Seq[Attribute]

outputColumns…FIXME

Note	`outputColumns` is used when: CreateHiveTableAsSelectCommand, InsertIntoHiveDirCommand and InsertIntoHadoopFsRelationCommand logical commands are executed `SaveAsHiveFile` is requested to saveAsHiveFile

spark-sql 第29页

CreateDataSourceTableAsSelectCommand Logical Command

Executing Logical Command — run Method

ClearCacheCommand Logical Command

AnalyzeTableCommand Logical Command — Computing Table-Level Statistics (Total Size and Row Count)

Executing Logical Command (Computing Table-Level Statistics and Altering Metastore) — run Method

Creating AnalyzeTableCommand Instance

AnalyzePartitionCommand Logical Command — Computing Partition-Level Statistics (Total Size and Row Count)

Executing Logical Command (Computing Partition-Level Statistics and Altering Metastore) — run Method

Computing Row Count Statistics Per Partition — calculateRowCountsPerPartition Internal Method

getPartitionSpec Internal Method

Creating AnalyzePartitionCommand Instance

AnalyzeColumnCommand Logical Command for ANALYZE TABLE…COMPUTE STATISTICS FOR COLUMNS SQL Command

Executing Logical Command — run Method

Computing Statistics for Specified Columns — computeColumnStats Internal Method

computePercentiles Internal Method

Creating AnalyzeColumnCommand Instance

AnalysisBarrier Leaf Logical Operator — Hiding Child Query Plan in Analysis

AlterViewAsCommand Logical Command

Executing Logical Command — run Method

alterPermanentView Internal Method

Aggregate Unary Logical Operator

Rule-Based Logical Query Optimization Phase

SaveAsHiveFile Contract — DataWritingCommands That Write Query Result As Hive Files

saveAsHiveFile Method

DataWritingCommand Contract — Logical Commands That Write Data

basicWriteJobStatsTracker Method

Output Columns — outputColumns Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Executing Logical Command — `run` Method

Executing Logical Command (Computing Table-Level Statistics and Altering Metastore) — `run` Method

Executing Logical Command (Computing Partition-Level Statistics and Altering Metastore) — `run` Method

Computing Row Count Statistics Per Partition — `calculateRowCountsPerPartition` Internal Method

`getPartitionSpec` Internal Method

Executing Logical Command — `run` Method

Computing Statistics for Specified Columns — `computeColumnStats` Internal Method

`computePercentiles` Internal Method

Executing Logical Command — `run` Method

`alterPermanentView` Internal Method

`saveAsHiveFile` Method

`basicWriteJobStatsTracker` Method

Output Columns — `outputColumns` Method