spark-sql-spark技术分享-第47页

TextBasedFileFormat — Base for Text Splitable FileFormats

TextBasedFileFormat is an extension of the FileFormat contract for formats that can be splitable.

Table 1. TextBasedFileFormats
TextBasedFileFormat	Description
CSVFileFormat
JsonFileFormat
`LibSVMFileFormat`	Used in Spark MLlib
TextFileFormat

TextBasedFileFormat uses Hadoop’s CompressionCodecFactory to find the proper compression codec for the given file.

`isSplitable` Method



isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

isSplitable(

sparkSession: SparkSession,

options: Map[String, String],

path: Path): Boolean

Note	`isSplitable` is part of FileFormat Contract to know whether a given file is splitable or not.

isSplitable requests the CompressionCodecFactory to find the compression codec for the given file (as the input path) based on its filename suffix.

isSplitable returns true when the compression codec is not used (i.e. null) or is a Hadoop SplittableCompressionCodec (e.g. BZip2Codec).

If the CompressionCodecFactory is not defined, isSplitable creates a CompressionCodecFactory (with a Hadoop Configuration by requesting the SessionState for a new Hadoop Configuration with extra options).

Note	`isSplitable` uses the input `sparkSession` to access SessionState.

Note

SplittableCompressionCodec interface is for compression codecs that are capable to compress and de-compress a stream starting at any arbitrary position.

Such codecs are highly valuable, especially in the context of Hadoop, because an input compressed file can be split and hence can be worked on by multiple machines in parallel.

One such compression codec is BZip2Codec that provides output and input streams for bzip2 compression and decompression.

ParquetFileFormat

ParquetFileFormat is a FileFormat for data sources in parquet format (i.e. registers itself to handle files in parquet format and converts them to Spark SQL rows).

Note	`parquet` is the default data source format in Spark SQL.

Note	Apache Parquet is a columnar storage format for the Apache Hadoop ecosystem with support for efficient storage and encoding of data.



// All the following queries are equivalent
// schema has to be specified manually
import org.apache.spark.sql.types.StructType
val schema = StructType($"id".int :: Nil)

spark.read.schema(schema).format("parquet").load("parquet-datasets")

// Implicitly does format("parquet").load
spark.read.schema(schema).parquet("parquet-datasets")

// parquet is the default data source format
spark.read.schema(schema).load("parquet-datasets")

// All the following queries are equivalent

// schema has to be specified manually

import org.apache.spark.sql.types.StructType

val schema = StructType($"id".int :: Nil)

spark.read.schema(schema).format("parquet").load("parquet-datasets")

// Implicitly does format("parquet").load

spark.read.schema(schema).parquet("parquet-datasets")

// parquet is the default data source format

spark.read.schema(schema).load("parquet-datasets")

ParquetFileFormat is splitable, i.e. FIXME

ParquetFileFormat supports batch when all of the following hold:

spark.sql.parquet.enableVectorizedReader configuration property is enabled
spark.sql.codegen.wholeStage internal configuration property is enabled
The number of fields in the schema is at most spark.sql.codegen.maxFields internal configuration property
All the fields in the output schema are of AtomicType

Tip

Enable DEBUG logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat=DEBUG

log4j.logger.org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat=DEBUG

Refer to Logging.

Preparing Write Job — `prepareWrite` Method



prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

prepareWrite(

sparkSession: SparkSession,

job: Job,

options: Map[String, String],

dataSchema: StructType): OutputWriterFactory

Note	`prepareWrite` is part of the FileFormat Contract to prepare a write job.

prepareWrite…FIXME

`inferSchema` Method



inferSchema(
  sparkSession: SparkSession,
  parameters: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

inferSchema(

sparkSession: SparkSession,

parameters: Map[String, String],

files: Seq[FileStatus]): Option[StructType]

Note	`inferSchema` is part of FileFormat Contract to…FIXME.

inferSchema…FIXME

`vectorTypes` Method



vectorTypes: Option[Seq[String]]

vectorTypes: Option[Seq[String]]

Note	`vectorTypes` is part of ColumnarBatchScan Contract to…FIXME.

vectorTypes…FIXME

Building Data Reader With Partition Column Values Appended — `buildReaderWithPartitionValues` Method



buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

buildReaderWithPartitionValues(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

Note	`buildReaderWithPartitionValues` is part of FileFormat Contract to build a data reader with the partition column values appended.

buildReaderWithPartitionValues sets the configuration options in the input hadoopConf.

Table 1. Configuration Options
Name	Value
`parquet.read.support.class`	`org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport`
`org.apache.spark.sql.parquet.row.requested_schema`	JSON representation of `requiredSchema`
`org.apache.spark.sql.parquet.row.attributes`	JSON representation of `requiredSchema`
`spark.sql.session.timeZone`	spark.sql.session.timeZone
`spark.sql.parquet.binaryAsString`	spark.sql.parquet.binaryAsString
`spark.sql.parquet.int96AsTimestamp`	spark.sql.parquet.int96AsTimestamp

buildReaderWithPartitionValues requests ParquetWriteSupport to setSchema.

buildReaderWithPartitionValues tries to push filters down to create a Parquet FilterPredicate (aka pushed).

Note	Filter predicate push-down optimization for parquet data sources uses spark.sql.parquet.filterPushdown configuration property which is enabled by default.

With spark.sql.parquet.filterPushdown configuration property enabled, buildReaderWithPartitionValues takes the input Spark data source filters and converts them to Parquet filter predicates if possible (as described in the table). Otherwise, the Parquet filter predicate is not specified.

Note	`buildReaderWithPartitionValues` creates filter predicates for the following types: BooleanType, IntegerType, LongType, FloatType, DoubleType, StringType, BinaryType.

Table 2. Spark Data Source Filters to Parquet Filter Predicates Conversions (aka ParquetFilters.createFilter)
Data Source Filter	Parquet FilterPredicate
`IsNull`	`FilterApi.eq`
`IsNotNull`	`FilterApi.notEq`
`EqualTo`	`FilterApi.eq`
`Not EqualTo`	`FilterApi.notEq`
`EqualNullSafe`	`FilterApi.eq`
`Not EqualNullSafe`	`FilterApi.notEq`
`LessThan`	`FilterApi.lt`
`LessThanOrEqual`	`FilterApi.ltEq`
`GreaterThan`	`FilterApi.gt`
`GreaterThanOrEqual`	`FilterApi.gtEq`
`And`	`FilterApi.and`
`Or`	`FilterApi.or`
`Not`	`FilterApi.not`

buildReaderWithPartitionValues broadcasts the input hadoopConf Hadoop Configuration.

In the end, buildReaderWithPartitionValues gives a function that takes a PartitionedFile and does the following:

Creates a Hadoop FileSplit for the input PartitionedFile
Creates a Parquet ParquetInputSplit for the Hadoop FileSplit created
Gets the broadcast Hadoop Configuration
Creates a flag that says whether to apply timezone conversions to int96 timestamps or not (aka convertTz)
Creates a Hadoop TaskAttemptContextImpl (with the broadcast Hadoop Configuration and a Hadoop TaskAttemptID for a map task)
Sets the Parquet FilterPredicate (only when spark.sql.parquet.filterPushdown configuration property is enabled and it is by default)

The function then branches off on whether Parquet vectorized reader is enabled or not.

Note	Parquet vectorized reader is enabled by default.

With Parquet vectorized reader enabled, the function does the following:

Creates a VectorizedParquetRecordReader and a RecordReaderIterator
Requests VectorizedParquetRecordReader to initialize (with the Parquet ParquetInputSplit and the Hadoop TaskAttemptContextImpl)
Prints out the following DEBUG message to the logs:

Appending [partitionSchema] [partitionValues]

1
2
3
4
5

Appending [partitionSchema] [partitionValues]
Requests VectorizedParquetRecordReader to initBatch
(only with supportBatch enabled) Requests VectorizedParquetRecordReader to enableReturningBatches
In the end, the function gives the RecordReaderIterator (over the VectorizedParquetRecordReader) as the Iterator[InternalRow]

With Parquet vectorized reader disabled, the function does the following:

FIXME (since Parquet vectorized reader is enabled by default it’s of less interest currently)

`mergeSchemasInParallel` Method



mergeSchemasInParallel(
  filesToTouch: Seq[FileStatus],
  sparkSession: SparkSession): Option[StructType]

mergeSchemasInParallel(

filesToTouch: Seq[FileStatus],

sparkSession: SparkSession): Option[StructType]

mergeSchemasInParallel…FIXME

Note	`mergeSchemasInParallel` is used when…FIXME

OrcFileFormat

2012-04-26admin阅读(1960)

OrcFileFormat

OrcFileFormat is a FileFormat that…FIXME

`buildReaderWithPartitionValues` Method



buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

buildReaderWithPartitionValues(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

Note	`buildReaderWithPartitionValues` is part of FileFormat Contract to build a data reader with partition column values appended.

buildReaderWithPartitionValues…FIXME

`inferSchema` Method



inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

inferSchema(

sparkSession: SparkSession,

options: Map[String, String],

files: Seq[FileStatus]): Option[StructType]

Note	`inferSchema` is part of FileFormat Contract to…FIXME.

inferSchema…FIXME

Building Partitioned Data Reader — `buildReader` Method



buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

buildReader(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]

Note	`buildReader` is part of FileFormat Contract to…FIXME

buildReader…FIXME

FileFormat

2012-04-25admin阅读(2168)

FileFormat — Data Sources to Read and Write Data In Files

FileFormat is the contract for data sources that read and write data stored in files.

Method Description

buildReader



buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReader(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

Builds a Catalyst data reader, i.e. a function that reads a PartitionedFile file as InternalRows.

buildReader throws an UnsupportedOperationException by default (and should therefore be overriden to work):



buildReader is not supported for [this]

buildReader is not supported for [this]

Used exclusively when FileFormat is requested to buildReaderWithPartitionValues

buildReaderWithPartitionValues



buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues builds a data reader with partition column values appended, i.e. a function that is used to read a single file in (as a PartitionedFile) as an Iterator of InternalRows (like buildReader) with the partition values appended.

Used exclusively when FileSourceScanExec physical operator is requested for the inputRDD (when requested for the inputRDDs and execution)

inferSchema



inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

inferSchema(

sparkSession: SparkSession,

options: Map[String, String],

files: Seq[FileStatus]): Option[StructType]

Infers (returns) the schema of the given files (as Hadoop’s FileStatuses) if supported. Otherwise, None should be returned.

Used when:

HiveMetastoreCatalog is requested to inferIfNeeded (when RelationConversions logical evaluation rule is requested to convert a HiveTableRelation to a LogicalRelation for parquet, native and hive ORC storage formats)
DataSource is requested to getOrInferFileFormatSchema and resolveRelation

isSplitable



isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

isSplitable(

sparkSession: SparkSession,

options: Map[String, String],

path: Path): Boolean

Controls whether the format (under the given path as Hadoop Path) can be split or not.

isSplitable is disabled (false) by default.

Used exclusively when FileSourceScanExec physical operator is requested to create an RDD for non-bucketed reads (when requested for the inputRDD and neither the optional bucketing specification of the HadoopFsRelation is defined nor bucketing is enabled)

prepareWrite



prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

prepareWrite(

sparkSession: SparkSession,

job: Job,

options: Map[String, String],

dataSchema: StructType): OutputWriterFactory

Prepares a write job and returns an OutputWriterFactory

Used exclusively when FileFormatWriter is requested to write query result

supportBatch



supportBatch(
  sparkSession: SparkSession,
  dataSchema: StructType): Boolean

supportBatch(

sparkSession: SparkSession,

dataSchema: StructType): Boolean

Flag that says whether the format supports columnar batch (i.e. vectorized decoding) or not.

isSplitable is off (false) by default.

Used exclusively when FileSourceScanExec physical operator is requested for the supportsBatch

vectorTypes



vectorTypes(
  requiredSchema: StructType,
  partitionSchema: StructType,
  sqlConf: SQLConf): Option[Seq[String]]

vectorTypes(

requiredSchema: StructType,

partitionSchema: StructType,

sqlConf: SQLConf): Option[Seq[String]]

vectorTypes is the concrete column vector class names for each column to be used in a columnar batch when enabled

vectorTypes is undefined (None) by default.

Used exclusively when FileSourceScanExec physical operator is requested for the vectorTypes

Table 2. FileFormats (Direct Implementations and Extensions)
FileFormat	Description
AvroFileFormat	Avro data source
HiveFileFormat	Writes hive tables
OrcFileFormat	ORC data source
ParquetFileFormat	Parquet data source
TextBasedFileFormat	Base for text splitable `FileFormats`

Building Data Reader With Partition Column Values Appended — `buildReaderWithPartitionValues` Method



buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues(

sparkSession: SparkSession,

dataSchema: StructType,

partitionSchema: StructType,

requiredSchema: StructType,

filters: Seq[Filter],

options: Map[String, String],

hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function from buildReader.

Internally, buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:

Creates a converter by requesting GenerateUnsafeProjection to generate an UnsafeProjection for the attributes of the input requiredSchema and partitionSchema
Applies the data reader to a PartitionedFile and converts the result using the converter on the joined row with the partition column values appended.

Note	`buildReaderWithPartitionValues` is used exclusively when `FileSourceScanExec` physical operator is requested for the input RDDs.

Catalyst DSL — Implicit Conversions for Catalyst Data Structures

2012-04-24admin阅读(1558)

Catalyst DSL — Implicit Conversions for Catalyst Data Structures

Catalyst DSL is a collection of Scala implicit conversions for constructing Catalyst data structures, i.e. expressions and logical plans, more easily.

The goal of Catalyst DSL is to make working with Spark SQL’s building blocks easier (e.g. for testing or Spark SQL internals exploration).

Table 1. Catalyst DSL’s Implicit Conversions
Name	Description
ExpressionConversions	Creates expressions Literals UnresolvedAttribute and UnresolvedReference …
ImplicitOperators	Adds operators to expressions for complex expressions
plans	Creates logical plans hint join table DslLogicalPlan

Catalyst DSL is part of org.apache.spark.sql.catalyst.dsl package object.



import org.apache.spark.sql.catalyst.dsl.expressions._
scala> :type $"hello"
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

import org.apache.spark.sql.catalyst.dsl.expressions._

scala> :type $"hello"

org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

Important

Some implicit conversions from the Catalyst DSL interfere with the implicits conversions from SQLImplicits that are imported automatically in spark-shell (through spark.implicits._).



scala> 'hello.decimal
<console>:30: error: type mismatch;
 found   : Symbol
 required: ?{def decimal: ?}
Note that implicit conversions are not applicable because they are ambiguous:
 both method symbolToColumn in class SQLImplicits of type (s: Symbol)org.apache.spark.sql.ColumnName
 and method DslSymbol in trait ExpressionConversions of type (sym: Symbol)org.apache.spark.sql.catalyst.dsl.expressions.DslSymbol
 are possible conversion functions from Symbol to ?{def decimal: ?}
       'hello.decimal
       ^
<console>:30: error: value decimal is not a member of Symbol
       'hello.decimal
              ^

scala> 'hello.decimal

<console>:30: error: type mismatch;

found : Symbol

required: ?{def decimal: ?}

Note that implicit conversions are not applicable because they are ambiguous:

both method symbolToColumn in class SQLImplicits of type (s: Symbol)org.apache.spark.sql.ColumnName

and method DslSymbol in trait ExpressionConversions of type (sym: Symbol)org.apache.spark.sql.catalyst.dsl.expressions.DslSymbol

are possible conversion functions from Symbol to ?{def decimal: ?}

'hello.decimal

<console>:30: error: value decimal is not a member of Symbol

'hello.decimal

Use sbt console with Spark libraries defined (in build.sbt) instead.

You can also disable an implicit conversion using a trick described in How can an implicit be unimported from the Scala repl?



// HACK: Disable symbolToColumn implicit conversion
// It is imported automatically in spark-shell (and makes demos impossible)
// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName
trait ThatWasABadIdea
implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

// HACK: Disable $ string interpolator
// It is imported automatically in spark-shell (and makes demos impossible)
implicit class StringToColumn(val sc: StringContext) {}

// HACK: Disable symbolToColumn implicit conversion

// It is imported automatically in spark-shell (and makes demos impossible)

// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName

trait ThatWasABadIdea

implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

// HACK: Disable $ string interpolator

// It is imported automatically in spark-shell (and makes demos impossible)

implicit class StringToColumn(val sc: StringContext) {}



import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._

// ExpressionConversions

import org.apache.spark.sql.catalyst.expressions.Literal
scala> val trueLit: Literal = true
trueLit: org.apache.spark.sql.catalyst.expressions.Literal = true

import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
scala> val name: UnresolvedAttribute = 'name
name: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'name

// NOTE: This conversion may not work, e.g. in spark-shell
// There is another implicit conversion StringToColumn in SQLImplicits
// It is automatically imported in spark-shell
// See :imports
val id: UnresolvedAttribute = $"id"

import org.apache.spark.sql.catalyst.expressions.Expression
scala> val expr: Expression = sum('id)
expr: org.apache.spark.sql.catalyst.expressions.Expression = sum('id)

// implicit class DslSymbol
scala> 'hello.s
res2: String = hello

scala> 'hello.attr
res4: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'hello

// implicit class DslString
scala> "helo".expr
res0: org.apache.spark.sql.catalyst.expressions.Expression = helo

scala> "helo".attr
res1: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'helo

// logical plans

scala> val t1 = table("t1")
t1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'UnresolvedRelation `t1`

scala> val p = t1.select('*).serialize[String].where('id % 2 == 0)
p: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Filter false
+- 'SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#1]
   +- 'Project ['*]
      +- 'UnresolvedRelation `t1`

// FIXME Does not work because SimpleAnalyzer's catalog is empty
// the p plan references a t1 table
import org.apache.spark.sql.catalyst.analysis.SimpleAnalyzer
scala> p.analyze

import org.apache.spark.sql.catalyst.dsl.expressions._

import org.apache.spark.sql.catalyst.dsl.plans._

// ExpressionConversions

import org.apache.spark.sql.catalyst.expressions.Literal

scala> val trueLit: Literal = true

trueLit: org.apache.spark.sql.catalyst.expressions.Literal = true

import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

scala> val name: UnresolvedAttribute = 'name

name: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'name

// NOTE: This conversion may not work, e.g. in spark-shell

// There is another implicit conversion StringToColumn in SQLImplicits

// It is automatically imported in spark-shell

// See :imports

val id: UnresolvedAttribute = $"id"

import org.apache.spark.sql.catalyst.expressions.Expression

scala> val expr: Expression = sum('id)

expr: org.apache.spark.sql.catalyst.expressions.Expression = sum('id)

// implicit class DslSymbol

scala> 'hello.s

res2: String = hello

scala> 'hello.attr

res4: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'hello

// implicit class DslString

scala> "helo".expr

res0: org.apache.spark.sql.catalyst.expressions.Expression = helo

scala> "helo".attr

res1: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'helo

// logical plans

scala> val t1 = table("t1")

t1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =

'UnresolvedRelation `t1`

scala> val p = t1.select('*).serialize[String].where('id % 2 == 0)

p: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =

'Filter false

+- 'SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#1]

+- 'Project ['*]

+- 'UnresolvedRelation `t1`

// FIXME Does not work because SimpleAnalyzer's catalog is empty

// the p plan references a t1 table

import org.apache.spark.sql.catalyst.analysis.SimpleAnalyzer

scala> p.analyze

`ImplicitOperators` Implicit Conversions

Operators for expressions, i.e. in.

`ExpressionConversions` Implicit Conversions

ExpressionConversions implicit conversions add ImplicitOperators operators to Catalyst expressions.

Type Conversions to Literal Expressions

ExpressionConversions adds conversions of Scala native types (e.g. Boolean, Long, String, Date, Timestamp) and Spark SQL types (i.e. Decimal) to Literal expressions.



// DEMO FIXME

// DEMO FIXME

Converting Symbols to UnresolvedAttribute and AttributeReference Expressions

ExpressionConversions adds conversions of Scala’s Symbol to UnresolvedAttribute and AttributeReference expressions.



// DEMO FIXME

// DEMO FIXME

Converting $-Prefixed String Literals to UnresolvedAttribute Expressions

ExpressionConversions adds conversions of $"col name" to an UnresolvedAttribute expression.



// DEMO FIXME

// DEMO FIXME

Adding Aggregate And Non-Aggregate Functions to Expressions



star(names: String*): Expression

star(names: String*): Expression

ExpressionConversions adds the aggregate and non-aggregate functions to Catalyst expressions (e.g. sum, count, upper, star, callFunction, windowSpec, windowExpr)



import org.apache.spark.sql.catalyst.dsl.expressions._
val s = star()

import org.apache.spark.sql.catalyst.analysis.UnresolvedStar
assert(s.isInstanceOf[UnresolvedStar])

val s = star("a", "b")
scala> println(s)
WrappedArray(a, b).*

import org.apache.spark.sql.catalyst.dsl.expressions._

val s = star()

import org.apache.spark.sql.catalyst.analysis.UnresolvedStar

assert(s.isInstanceOf[UnresolvedStar])

val s = star("a", "b")

scala> println(s)

WrappedArray(a, b).*

Creating UnresolvedFunction Expressions — `function` and `distinctFunction` Methods

ExpressionConversions allows creating UnresolvedFunction expressions with function and distinctFunction operators.



function(exprs: Expression*): UnresolvedFunction
distinctFunction(exprs: Expression*): UnresolvedFunction

function(exprs: Expression*): UnresolvedFunction

distinctFunction(exprs: Expression*): UnresolvedFunction



import org.apache.spark.sql.catalyst.dsl.expressions._

// Works with Scala Symbols only
val f = 'f.function()
scala> :type f
org.apache.spark.sql.catalyst.analysis.UnresolvedFunction

scala> f.isDistinct
res0: Boolean = false

val g = 'g.distinctFunction()
scala> g.isDistinct
res1: Boolean = true

import org.apache.spark.sql.catalyst.dsl.expressions._

// Works with Scala Symbols only

val f = 'f.function()

scala> :type f

org.apache.spark.sql.catalyst.analysis.UnresolvedFunction

scala> f.isDistinct

res0: Boolean = false

val g = 'g.distinctFunction()

scala> g.isDistinct

res1: Boolean = true

Creating AttributeReference Expressions With nullability On or Off — `notNull` and `canBeNull` Methods

ExpressionConversions adds canBeNull and notNull operators to create a AttributeReference with nullability turned on or off, respectively.



notNull: AttributeReference
canBeNull: AttributeReference

notNull: AttributeReference

canBeNull: AttributeReference



// DEMO FIXME

// DEMO FIXME

Creating BoundReference — `at` Method



at(ordinal: Int): BoundReference

at(ordinal: Int): BoundReference

ExpressionConversions adds at method to AttributeReferences to create BoundReference expressions.



import org.apache.spark.sql.catalyst.dsl.expressions._
val boundRef = 'hello.string.at(4)
scala> println(boundRef)
input[4, string, true]

import org.apache.spark.sql.catalyst.dsl.expressions._

val boundRef = 'hello.string.at(4)

scala> println(boundRef)

input[4, string, true]

`plans` Implicit Conversions for Logical Plans

Creating UnresolvedHint Logical Operator — `hint` Method

plans adds hint method to create a UnresolvedHint logical operator.



hint(name: String, parameters: Any*): LogicalPlan

hint(name: String, parameters: Any*): LogicalPlan

Creating Join Logical Operator — `join` Method

join creates a Join logical operator.



join(
  otherPlan: LogicalPlan,
  joinType: JoinType = Inner,
  condition: Option[Expression] = None): LogicalPlan

join(

otherPlan: LogicalPlan,

joinType: JoinType = Inner,

condition: Option[Expression] = None): LogicalPlan

Creating UnresolvedRelation Logical Operator — `table` Method

table creates a UnresolvedRelation logical operator.



table(ref: String): LogicalPlan
table(db: String, ref: String): LogicalPlan

table(ref: String): LogicalPlan

table(db: String, ref: String): LogicalPlan



import org.apache.spark.sql.catalyst.dsl.plans._

val t1 = table("t1")
scala> println(t1.treeString)
'UnresolvedRelation `t1`

import org.apache.spark.sql.catalyst.dsl.plans._

val t1 = table("t1")

scala> println(t1.treeString)

'UnresolvedRelation `t1`

`DslLogicalPlan` Implicit Class



implicit class DslLogicalPlan(val logicalPlan: LogicalPlan)

implicit class DslLogicalPlan(val logicalPlan: LogicalPlan)

DslLogicalPlan implicit class is part of plans implicit conversions with extension methods (of logical operators) to build entire logical plans.



select(exprs: Expression*): LogicalPlan
where(condition: Expression): LogicalPlan
filter[T: Encoder](func: T => Boolean): LogicalPlan
filter[T: Encoder](func: FilterFunction[T]): LogicalPlan
serialize[T: Encoder]: LogicalPlan
deserialize[T: Encoder]: LogicalPlan
limit(limitExpr: Expression): LogicalPlan
join(
  otherPlan: LogicalPlan,
  joinType: JoinType = Inner,
  condition: Option[Expression] = None): LogicalPlan
cogroup[Key: Encoder, Left: Encoder, Right: Encoder, Result: Encoder](
  otherPlan: LogicalPlan,
  func: (Key, Iterator[Left], Iterator[Right]) => TraversableOnce[Result],
  leftGroup: Seq[Attribute],
  rightGroup: Seq[Attribute],
  leftAttr: Seq[Attribute],
  rightAttr: Seq[Attribute]): LogicalPlan
orderBy(sortExprs: SortOrder*): LogicalPlan
sortBy(sortExprs: SortOrder*): LogicalPlan
groupBy(groupingExprs: Expression*)(aggregateExprs: Expression*): LogicalPlan
window(
  windowExpressions: Seq[NamedExpression],
  partitionSpec: Seq[Expression],
  orderSpec: Seq[SortOrder]): LogicalPlan
subquery(alias: Symbol): LogicalPlan
except(otherPlan: LogicalPlan): LogicalPlan
intersect(otherPlan: LogicalPlan): LogicalPlan
union(otherPlan: LogicalPlan): LogicalPlan
generate(
  generator: Generator,
  unrequiredChildIndex: Seq[Int] = Nil,
  outer: Boolean = false,
  alias: Option[String] = None,
  outputNames: Seq[String] = Nil): LogicalPlan
insertInto(tableName: String, overwrite: Boolean = false): LogicalPlan
as(alias: String): LogicalPlan
coalesce(num: Integer): LogicalPlan
repartition(num: Integer): LogicalPlan
distribute(exprs: Expression*)(n: Int): LogicalPlan
hint(name: String, parameters: Any*): LogicalPlan

select(exprs: Expression*): LogicalPlan

where(condition: Expression): LogicalPlan

filter[T: Encoder](func: T => Boolean): LogicalPlan

filter[T: Encoder](func: FilterFunction[T]): LogicalPlan

serialize[T: Encoder]: LogicalPlan

deserialize[T: Encoder]: LogicalPlan

limit(limitExpr: Expression): LogicalPlan

join(

otherPlan: LogicalPlan,

joinType: JoinType = Inner,

condition: Option[Expression] = None): LogicalPlan

cogroup[Key: Encoder, Left: Encoder, Right: Encoder, Result: Encoder](

otherPlan: LogicalPlan,

func: (Key, Iterator[Left], Iterator[Right]) => TraversableOnce[Result],

leftGroup: Seq[Attribute],

rightGroup: Seq[Attribute],

leftAttr: Seq[Attribute],

rightAttr: Seq[Attribute]): LogicalPlan

orderBy(sortExprs: SortOrder*): LogicalPlan

sortBy(sortExprs: SortOrder*): LogicalPlan

groupBy(groupingExprs: Expression*)(aggregateExprs: Expression*): LogicalPlan

window(

windowExpressions: Seq[NamedExpression],

partitionSpec: Seq[Expression],

orderSpec: Seq[SortOrder]): LogicalPlan

subquery(alias: Symbol): LogicalPlan

except(otherPlan: LogicalPlan): LogicalPlan

intersect(otherPlan: LogicalPlan): LogicalPlan

union(otherPlan: LogicalPlan): LogicalPlan

generate(

generator: Generator,

unrequiredChildIndex: Seq[Int] = Nil,

outer: Boolean = false,

alias: Option[String] = None,

outputNames: Seq[String] = Nil): LogicalPlan

insertInto(tableName: String, overwrite: Boolean = false): LogicalPlan

as(alias: String): LogicalPlan

coalesce(num: Integer): LogicalPlan

repartition(num: Integer): LogicalPlan

distribute(exprs: Expression*)(n: Int): LogicalPlan

hint(name: String, parameters: Any*): LogicalPlan



// Import plans object
// That loads implicit class DslLogicalPlan
// And so every LogicalPlan is the "target" of the DslLogicalPlan methods
import org.apache.spark.sql.catalyst.dsl.plans._

val t1 = table(ref = "t1")

// HACK: Disable symbolToColumn implicit conversion
// It is imported automatically in spark-shell (and makes demos impossible)
// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName
trait ThatWasABadIdea
implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

import org.apache.spark.sql.catalyst.dsl.expressions._
val id = 'id.long
val logicalPlan = t1.select(id)
scala> println(logicalPlan.numberedTreeString)
00 'Project [id#1L]
01 +- 'UnresolvedRelation `t1`

val t2 = table("t2")
import org.apache.spark.sql.catalyst.plans.LeftSemi
val logicalPlan = t1.join(t2, joinType = LeftSemi, condition = Some(id))
scala> println(logicalPlan.numberedTreeString)
00 'Join LeftSemi, id#1: bigint
01 :- 'UnresolvedRelation `t1`
02 +- 'UnresolvedRelation `t2`

// Import plans object

// That loads implicit class DslLogicalPlan

// And so every LogicalPlan is the "target" of the DslLogicalPlan methods

import org.apache.spark.sql.catalyst.dsl.plans._

val t1 = table(ref = "t1")

// HACK: Disable symbolToColumn implicit conversion

// It is imported automatically in spark-shell (and makes demos impossible)

// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName

trait ThatWasABadIdea

implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

import org.apache.spark.sql.catalyst.dsl.expressions._

val id = 'id.long

val logicalPlan = t1.select(id)

scala> println(logicalPlan.numberedTreeString)

00 'Project [id#1L]

01 +- 'UnresolvedRelation `t1`

val t2 = table("t2")

import org.apache.spark.sql.catalyst.plans.LeftSemi

val logicalPlan = t1.join(t2, joinType = LeftSemi, condition = Some(id))

scala> println(logicalPlan.numberedTreeString)

00 'Join LeftSemi, id#1: bigint

01 :- 'UnresolvedRelation `t1`

02 +- 'UnresolvedRelation `t2`

Analyzing Logical Plan — `analyze` Method



analyze: LogicalPlan

analyze: LogicalPlan

analyze resolves attribute references.

analyze method is part of DslLogicalPlan implicit class.

Internally, analyze uses EliminateSubqueryAliases logical optimization and SimpleAnalyzer logical analyzer.



// DEMO FIXME

// DEMO FIXME

CommandUtils — Utilities for Table Statistics

2012-04-23admin阅读(1705)

CommandUtils — Utilities for Table Statistics

CommandUtils is a helper class that logical commands, e.g. InsertInto*, AlterTable*Command, LoadDataCommand, and CBO’s Analyze*, use to manage table statistics.

CommandUtils defines the following utilities:

Calculating Total Size of Table or Its Partitions
Calculating Total File Size Under Path
Creating CatalogStatistics with Current Statistics
Updating Existing Table Statistics

Tip

Enable INFO logging level for org.apache.spark.sql.execution.command.CommandUtils logger to see what happens inside.

Add the following line to conf/log4j.properties:



log4j.logger.org.apache.spark.sql.execution.command.CommandUtils=INFO

log4j.logger.org.apache.spark.sql.execution.command.CommandUtils=INFO

Refer to Logging.

Updating Existing Table Statistics — `updateTableStats` Method



updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit

updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit

updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).

updateTableStats requests SessionCatalog to alterTableStats with the current total size (when spark.sql.statistics.size.autoUpdate.enabled property is turned on) or empty statistics (that effectively removes the recorded statistics completely).

Important

updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.

Note	`updateTableStats` uses `SparkSession` to access the current SessionState that it then uses to access the session-scoped SessionCatalog.

Note	`updateTableStats` is used when InsertIntoHiveTable, InsertIntoHadoopFsRelationCommand, `AlterTableDropPartitionCommand`, `AlterTableSetLocationCommand` and `LoadDataCommand` commands are executed.

Calculating Total Size of Table (with Partitions) — `calculateTotalSize` Method



calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): BigInt

calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): BigInt

calculateTotalSize calculates total file size for the entire input CatalogTable (when it has no partitions defined) or all its partitions (through the session-scoped SessionCatalog).

Note	`calculateTotalSize` uses the input `SessionState` to access the SessionCatalog.

Note

calculateTotalSize is used when:

AnalyzeColumnCommand and AnalyzeTableCommand commands are executed
CommandUtils is requested to update existing table statistics (when InsertIntoHiveTable, InsertIntoHadoopFsRelationCommand, AlterTableDropPartitionCommand, AlterTableSetLocationCommand and LoadDataCommand commands are executed)

Calculating Total File Size Under Path — `calculateLocationSize` Method



calculateLocationSize(
  sessionState: SessionState,
  identifier: TableIdentifier,
  locationUri: Option[URI]): Long

calculateLocationSize(

sessionState: SessionState,

identifier: TableIdentifier,

locationUri: Option[URI]): Long

calculateLocationSize reads hive.exec.stagingdir configuration property for the staging directory (with .hive-staging being the default).

You should see the following INFO message in the logs:



INFO CommandUtils: Starting to calculate the total file size under path [locationUri].

INFO CommandUtils: Starting to calculate the total file size under path [locationUri].

calculateLocationSize calculates the sum of the length of all the files under the input locationUri.

Note	`calculateLocationSize` uses Hadoop’s FileSystem.getFileStatus and FileStatus.getLen to access a file and the length of the file (in bytes), respectively.

In the end, you should see the following INFO message in the logs:



INFO CommandUtils: It took [durationInMs] ms to calculate the total file size under path [locationUri].

INFO CommandUtils: It took [durationInMs] ms to calculate the total file size under path [locationUri].

Note	`calculateLocationSize` is used when: AnalyzePartitionCommand and AlterTableAddPartitionCommand commands are executed `CommandUtils` is requested for total size of a table or its partitions

Creating CatalogStatistics with Current Statistics — `compareAndGetNewStats` Method



compareAndGetNewStats(
  oldStats: Option[CatalogStatistics],
  newTotalSize: BigInt,
  newRowCount: Option[BigInt]): Option[CatalogStatistics]

compareAndGetNewStats(

oldStats: Option[CatalogStatistics],

newTotalSize: BigInt,

newRowCount: Option[BigInt]): Option[CatalogStatistics]

compareAndGetNewStats creates a new CatalogStatistics with the input newTotalSize and newRowCount only when they are different from the oldStats.

Note	`compareAndGetNewStats` is used when AnalyzePartitionCommand and AnalyzeTableCommand are executed.

EstimationUtils

2012-04-22admin阅读(1519)

EstimationUtils

EstimationUtils is…FIXME

`getOutputSize` Method



getOutputSize(
  attributes: Seq[Attribute],
  outputRowCount: BigInt,
  attrStats: AttributeMap[ColumnStat] = AttributeMap(Nil)): BigInt

getOutputSize(

attributes: Seq[Attribute],

outputRowCount: BigInt,

attrStats: AttributeMap[ColumnStat] = AttributeMap(Nil)): BigInt

getOutputSize…FIXME

Note	`getOutputSize` is used when…FIXME

`nullColumnStat` Method



nullColumnStat(dataType: DataType, rowCount: BigInt): ColumnStat

nullColumnStat(dataType: DataType, rowCount: BigInt): ColumnStat

nullColumnStat…FIXME

Note	`nullColumnStat` is used exclusively when `JoinEstimation` is requested to estimateInnerOuterJoin for `LeftOuter` and `RightOuter` joins.

Checking Availability of Row Count Statistic — `rowCountsExist` Method



rowCountsExist(plans: LogicalPlan*): Boolean

rowCountsExist(plans: LogicalPlan*): Boolean

rowCountsExist is positive (i.e. true) when every logical plan (in the input plans) has estimated number of rows (aka row count) statistic computed.

Otherwise, rowCountsExist is negative (i.e. false).

Note	`rowCountsExist` uses `LogicalPlanStats` to access the estimated statistics and query hints of a logical plan.

Note

rowCountsExist is used when:

AggregateEstimation is requested to estimate statistics and query hints of a Aggregate logical operator
JoinEstimation is requested to estimate statistics and query hints of a Join logical operator (regardless of the join type)
ProjectEstimation is requested to estimate statistics and query hints of a Project logical operator

ColumnStat — Column Statistics

2012-04-21admin阅读(1410)

ColumnStat — Column Statistics

ColumnStat holds the statistics of a table column (as part of the table statistics in a metastore).

Table 1. Column Statistics
Name	Description
`distinctCount`	Number of distinct values
`min`	Minimum value
`max`	Maximum value
`nullCount`	Number of `null` values
`avgLen`	Average length of the values
`maxLen`	Maximum length of the values
`histogram`	Histogram of values (as `Histogram` which is empty by default)

ColumnStat is computed (and created from the result row) using ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command (that SparkSqlAstBuilder translates to AnalyzeColumnCommand logical command).



val cols = "id, p1, p2"
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols"
spark.sql(analyzeTableSQL)

val cols = "id, p1, p2"

val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols"

spark.sql(analyzeTableSQL)

ColumnStat may optionally hold the histogram of values which is empty by default. With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command generates column (equi-height) histograms.

Note	`spark.sql.statistics.histogram.enabled` is off by default.

You can inspect the column statistics using DESCRIBE EXTENDED SQL command.



scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      | <-- no histogram (spark.sql.statistics.histogram.enabled off)
+--------------+----------+

scala> sql("DESC EXTENDED t1 id").show

+--------------+----------+

|info_name |info_value|

+--------------+----------+

|col_name |id |

|data_type |int |

|comment |NULL |

|min |0 |

|max |1 |

|num_nulls |0 |

|distinct_count|2 |

|avg_col_len |4 |

|max_col_len |4 |

|histogram |NULL | <-- no histogram (spark.sql.statistics.histogram.enabled off)

+--------------+----------+

ColumnStat is part of the statistics of a table.



// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats.get

scala> :type stats
org.apache.spark.sql.catalyst.catalog.CatalogStatistics

val colStats = stats.colStats
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

// Make sure that you ran ANALYZE TABLE (as described above)

val db = spark.catalog.currentDatabase

val tableName = "t1"

val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)

val stats = metadata.stats.get

scala> :type stats

org.apache.spark.sql.catalyst.catalog.CatalogStatistics

val colStats = stats.colStats

scala> :type colStats

Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

ColumnStat is converted to properties (serialized) while persisting the table (statistics) to a metastore.



scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

val colName = "p1"

val p1stats = colStats(colName)
scala> :type p1stats
org.apache.spark.sql.catalyst.plans.logical.ColumnStat

import org.apache.spark.sql.types.DoubleType
val props = p1stats.toMap(colName, dataType = DoubleType)
scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

scala> :type colStats

Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

val colName = "p1"

val p1stats = colStats(colName)

scala> :type p1stats

org.apache.spark.sql.catalyst.plans.logical.ColumnStat

import org.apache.spark.sql.types.DoubleType

val props = p1stats.toMap(colName, dataType = DoubleType)

scala> println(props)

Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

ColumnStat is re-created from properties (deserialized) when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).



scala> :type props
Map[String,String]

scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

import org.apache.spark.sql.types.StructField
val p1 = $"p1".double

import org.apache.spark.sql.catalyst.plans.logical.ColumnStat
val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)

scala> :type colStatsOpt
Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

scala> :type props

Map[String,String]

scala> println(props)

Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

import org.apache.spark.sql.types.StructField

val p1 = $"p1".double

import org.apache.spark.sql.catalyst.plans.logical.ColumnStat

val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)

scala> :type colStatsOpt

Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

ColumnStat is also created when JoinEstimation is requested to estimateInnerOuterJoin for Inner, Cross, LeftOuter, RightOuter and FullOuter joins.



val tableName = "t1"

// Make the example reproducible
import org.apache.spark.sql.catalyst.TableIdentifier
val tid = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1
Seq((0, 0, "zero"), (1, 1, "one")).
  toDF("id", "p1", "p2").
  write.
  saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition files
// Invalidate cache
spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics
// that saves them in a metastore (aka an external catalog)
val df = spark.table(tableName)
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"
spark.sql(analyzeTableSQL)

// Fetch the table metadata (with column statistics) from a metastore
val metastore = spark.sharedState.externalCatalog
val db = spark.catalog.currentDatabase
val tableMeta = metastore.getTable(db, table = tableName)

// The column statistics are part of the table statistics
val colStats = tableMeta.stats.get.colStats

scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println)
// the output may vary
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p2: ColumnStat(2,None,None,0,4,4,None)

val tableName = "t1"

// Make the example reproducible

import org.apache.spark.sql.catalyst.TableIdentifier

val tid = TableIdentifier(tableName)

val sessionCatalog = spark.sessionState.catalog

sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1

Seq((0, 0, "zero"), (1, 1, "one")).

toDF("id", "p1", "p2").

write.

saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition files

// Invalidate cache

spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics

// that saves them in a metastore (aka an external catalog)

val df = spark.table(tableName)

val allCols = df.columns.mkString(",")

val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"

spark.sql(analyzeTableSQL)

// Fetch the table metadata (with column statistics) from a metastore

val metastore = spark.sharedState.externalCatalog

val db = spark.catalog.currentDatabase

val tableMeta = metastore.getTable(db, table = tableName)

// The column statistics are part of the table statistics

val colStats = tableMeta.stats.get.colStats

scala> :type colStats

Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println)

// the output may vary

id: ColumnStat(2,Some(0),Some(1),0,4,4,None)

p1: ColumnStat(2,Some(0),Some(1),0,4,4,None)

p2: ColumnStat(2,None,None,0,4,4,None)

Note	`ColumnStat` does not support minimum and maximum metrics for binary (i.e. `Array[Byte]`) and string types.

Converting Value to External/Java Representation (per Catalyst Data Type) — `toExternalString` Internal Method



toExternalString(v: Any, colName: String, dataType: DataType): String

toExternalString(v: Any, colName: String, dataType: DataType): String

toExternalString…FIXME

Note	`toExternalString` is used exclusively when `ColumnStat` is requested for statistic properties.

`supportsHistogram` Method



supportsHistogram(dataType: DataType): Boolean

supportsHistogram(dataType: DataType): Boolean

supportsHistogram…FIXME

Note	`supportsHistogram` is used when…FIXME

Converting ColumnStat to Properties (ColumnStat Serialization) — `toMap` Method



toMap(colName: String, dataType: DataType): Map[String, String]

toMap(colName: String, dataType: DataType): Map[String, String]

toMap converts ColumnStat to the properties.

Table 2. ColumnStat.toMap’s Properties
Key	Value
`version`	`1`
`distinctCount`	distinctCount
`nullCount`	nullCount
`avgLen`	avgLen
`maxLen`	maxLen
`min`	External/Java representation of min
`max`	External/Java representation of max
`histogram`	Serialized version of Histogram (using `HistogramSerializer.serialize`)

Note	`toMap` adds `min`, `max`, `histogram` entries only if they are available.

Note	Interestingly, `colName` and `dataType` input parameters bring no value to `toMap` itself, but merely allow for a more user-friendly error reporting when converting `min` and `max` column statistics.

Note	`toMap` is used exclusively when `HiveExternalCatalog` is requested for converting table statistics to properties (before persisting them as part of table metadata in a Hive metastore).

Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — `fromMap` Method



fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat]

fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat]

fromMap creates a ColumnStat by fetching properties of every column statistic from the input map.

fromMap returns None when recovering column statistics fails for whatever reason.



WARN Failed to parse column statistics for column [fieldName] in table [table]

WARN Failed to parse column statistics for column [fieldName] in table [table]

Note	Interestingly, `table` input parameter brings no value to `fromMap` itself, but merely allows for a more user-friendly error reporting when parsing column statistics fails.

Note	`fromMap` is used exclusively when `HiveExternalCatalog` is requested for restoring table statistics from properties (from a Hive Metastore).

Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — `rowToColumnStat` Method



rowToColumnStat(
  row: InternalRow,
  attr: Attribute,
  rowCount: Long,
  percentiles: Option[ArrayData]): ColumnStat

rowToColumnStat(

row: InternalRow,

attr: Attribute,

rowCount: Long,

percentiles: Option[ArrayData]): ColumnStat

rowToColumnStat creates a ColumnStat from the input row and the following positions:

distinctCount
min
max
nullCount
avgLen
maxLen

If the 6th field is not empty, rowToColumnStat uses it to create histogram.

Note	`rowToColumnStat` is used exclusively when `AnalyzeColumnCommand` is executed (to compute the statistics for specified columns).

`statExprs` Method



statExprs(
  col: Attribute,
  conf: SQLConf,
  colPercentiles: AttributeMap[ArrayData]): CreateNamedStruct

statExprs(

col: Attribute,

conf: SQLConf,

colPercentiles: AttributeMap[ArrayData]): CreateNamedStruct

statExprs…FIXME

Note	`statExprs` is used when…FIXME

CatalogStatistics — Table Statistics in Metastore (External Catalog)

2012-04-20admin阅读(1671)

CatalogStatistics — Table Statistics From External Catalog (Metastore)

CatalogStatistics are table statistics that are stored in an external catalog (aka metastore):

Physical total size (in bytes)
Estimated number of rows (aka row count)
Column statistics (i.e. column names and their statistics)

Note	`CatalogStatistics` is a “subset” of the statistics in Statistics (as there are no concepts of attributes and broadcast hint in metastore). `CatalogStatistics` are often stored in a Hive metastore and are referred as Hive statistics while `Statistics` are the Spark statistics.

CatalogStatistics can be converted to Spark statistics using toPlanStats method.

CatalogStatistics is created when:

AnalyzeColumnCommand, AlterTableAddPartitionCommand and TruncateTableCommand commands are executed (and store statistics in ExternalCatalog)
CommandUtils is requested for updating existing table statistics, the current statistics (if changed)
HiveExternalCatalog is requested for restoring Spark statistics from properties (from a Hive Metastore)
DetermineTableStats and PruneFileSourcePartitions logical optimizations are executed (i.e. applied to a logical plan)
HiveClientImpl is requested for a table or partition statistics from Hive’s parameters



scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog

// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats

scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

// Using low-level internal SessionCatalog interface to access CatalogTables
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats

scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

scala> :type spark.sessionState.catalog

org.apache.spark.sql.catalyst.catalog.SessionCatalog

// Using higher-level interface to access CatalogStatistics

// Make sure that you ran ANALYZE TABLE (as described above)

val db = spark.catalog.currentDatabase

val tableName = "t1"

val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)

val stats = metadata.stats

scala> :type stats

Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

// Using low-level internal SessionCatalog interface to access CatalogTables

val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)

val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)

val stats = metadata.stats

scala> :type stats

Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

CatalogStatistics has a text representation.



scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

scala> stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows

scala> :type stats

Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

scala> stats.map(_.simpleString).foreach(println)

714 bytes, 2 rows

Converting Metastore Statistics to Spark Statistics — `toPlanStats` Method



toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): Statistics

toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): Statistics

toPlanStats converts the table statistics (from an external metastore) to Spark statistics.

With cost-based optimization enabled and row count statistics available, toPlanStats creates a Statistics with the estimated total (output) size, row count and column statistics.

Note	Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. `true`, and is disabled by default.

Otherwise, when cost-based optimization is disabled, toPlanStats creates a Statistics with just the mandatory sizeInBytes.

Caution

FIXME Why does toPlanStats compute sizeInBytes differently per CBO?

Note

toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.



FIXME Example

FIXME Example

Note	`toPlanStats` is used when HiveTableRelation and LogicalRelation are requested for statistics.

Cost-Based Optimization (CBO)

2012-04-19admin阅读(1566)

Cost-Based Optimization (CBO) of Logical Query Plan

Cost-Based Optimization (aka Cost-Based Query Optimization or CBO Optimizer) is an optimization technique in Spark SQL that uses table statistics to determine the most efficient query execution plan of a structured query (given the logical query plan).

Cost-based optimization is disabled by default. Spark SQL uses spark.sql.cbo.enabled configuration property to control whether the CBO should be enabled and used for query optimization or not.

Cost-Based Optimization uses logical optimization rules (e.g. CostBasedJoinReorder) to optimize the logical plan of a structured query based on statistics.

You first use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute table statistics. Use DESCRIBE EXTENDED SQL command to inspect the statistics.

Logical operators have statistics support that is used for query planning.

There is also support for equi-height column histograms.

Table Statistics

The table statistics can be computed for tables, partitions and columns and are as follows:

Total size (in bytes) of a table or table partitions
Row count of a table or table partitions
Column statistics, i.e. min, max, num_nulls, distinct_count, avg_col_len, max_col_len, histogram

spark.sql.cbo.enabled Spark SQL Configuration Property

Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true.

Note	spark.sql.cbo.enabled configuration property is turned off, i.e. `false`, by default.

Tip	Use SQLConf.cboEnabled to access the current value of `spark.sql.cbo.enabled` property.



// CBO is disabled by default
val sqlConf = spark.sessionState.conf
scala> println(sqlConf.cboEnabled)
false

// Create a new SparkSession with CBO enabled
// You could spark-submit -c spark.sql.cbo.enabled=true
val sparkCboEnabled = spark.newSession
import org.apache.spark.sql.internal.SQLConf.CBO_ENABLED
sparkCboEnabled.conf.set(CBO_ENABLED.key, true)
val isCboEnabled = sparkCboEnabled.conf.get(CBO_ENABLED.key)
println(s"Is CBO enabled? $isCboEnabled")

// CBO is disabled by default

val sqlConf = spark.sessionState.conf

scala> println(sqlConf.cboEnabled)

false

// Create a new SparkSession with CBO enabled

// You could spark-submit -c spark.sql.cbo.enabled=true

val sparkCboEnabled = spark.newSession

import org.apache.spark.sql.internal.SQLConf.CBO_ENABLED

sparkCboEnabled.conf.set(CBO_ENABLED.key, true)

val isCboEnabled = sparkCboEnabled.conf.get(CBO_ENABLED.key)

println(s"Is CBO enabled? $isCboEnabled")

Note	CBO is disabled explicitly in Spark Structured Streaming.

ANALYZE TABLE COMPUTE STATISTICS SQL Command

Cost-Based Optimization uses the statistics stored in a metastore (aka external catalog) using ANALYZE TABLE SQL command.



ANALYZE TABLE tableIdentifier partitionSpec?
COMPUTE STATISTICS (NOSCAN | FOR COLUMNS identifierSeq)?

ANALYZE TABLE tableIdentifier partitionSpec?

COMPUTE STATISTICS (NOSCAN | FOR COLUMNS identifierSeq)?

Depending on the variant, ANALYZE TABLE computes different statistics, i.e. of a table, partitions or columns.

ANALYZE TABLE with neither PARTITION specification nor FOR COLUMNS clause
ANALYZE TABLE with PARTITION specification (but no FOR COLUMNS clause)
ANALYZE TABLE with FOR COLUMNS clause (but no PARTITION specification)

Tip	Use spark.sql.statistics.histogram.enabled configuration property to enable column (equi-height) histograms that can provide better estimation accuracy but cause an extra table scan). `spark.sql.statistics.histogram.enabled` is off by default.

Note

ANALYZE TABLE with PARTITION specification and FOR COLUMNS clause is incorrect.



// !!! INCORRECT !!!
ANALYZE TABLE t1 PARTITION (p1, p2) COMPUTE STATISTICS FOR COLUMNS id, p1

// !!! INCORRECT !!!

ANALYZE TABLE t1 PARTITION (p1, p2) COMPUTE STATISTICS FOR COLUMNS id, p1

In such a case, SparkSqlAstBuilder reports a WARN message to the logs and simply ignores the partition specification.



WARN Partition specification is ignored when collecting column statistics: [partitionSpec]

WARN Partition specification is ignored when collecting column statistics: [partitionSpec]

When executed, the above ANALYZE TABLE variants are translated to the following logical commands (in a logical query plan), respectively:

DESCRIBE EXTENDED SQL Command

You can view the statistics of a table, partitions or a column (stored in a metastore) using DESCRIBE EXTENDED SQL command.



(DESC | DESCRIBE) TABLE? (EXTENDED | FORMATTED)?
tableIdentifier partitionSpec? describeColName?

(DESC | DESCRIBE) TABLE? (EXTENDED | FORMATTED)?

tableIdentifier partitionSpec? describeColName?

Table-level statistics are in Statistics row while partition-level statistics are in Partition Statistics row.

Tip	Use `DESC EXTENDED tableName` for table-level statistics and `DESC EXTENDED tableName PARTITION (p1, p2, …)` for partition-level statistics only.



// table-level statistics are in Statistics row
scala> sql("DESC EXTENDED t1").show(numRows = 30, truncate = false)
+----------------------------+--------------------------------------------------------------+-------+
|col_name                    |data_type                                                     |comment|
+----------------------------+--------------------------------------------------------------+-------+
|id                          |int                                                           |null   |
|p1                          |int                                                           |null   |
|p2                          |string                                                        |null   |
|# Partition Information     |                                                              |       |
|# col_name                  |data_type                                                     |comment|
|p1                          |int                                                           |null   |
|p2                          |string                                                        |null   |
|                            |                                                              |       |
|# Detailed Table Information|                                                              |       |
|Database                    |default                                                       |       |
|Table                       |t1                                                            |       |
|Owner                       |jacek                                                         |       |
|Created Time                |Wed Dec 27 14:10:44 CET 2017                                  |       |
|Last Access                 |Thu Jan 01 01:00:00 CET 1970                                  |       |
|Created By                  |Spark 2.3.0                                                   |       |
|Type                        |MANAGED                                                       |       |
|Provider                    |parquet                                                       |       |
|Table Properties            |[transient_lastDdlTime=1514453141]                            |       |
|Statistics                  |714 bytes, 2 rows                                             |       |
|Location                    |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1            |       |
|Serde Library               |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe   |       |
|InputFormat                 |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |       |
|OutputFormat                |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|       |
|Storage Properties          |[serialization.format=1]                                      |       |
|Partition Provider          |Catalog                                                       |       |
+----------------------------+--------------------------------------------------------------+-------+

scala> spark.table("t1").show
+---+---+----+
| id| p1|  p2|
+---+---+----+
|  0|  0|zero|
|  1|  1| one|
+---+---+----+

// partition-level statistics are in Partition Statistics row
scala> sql("DESC EXTENDED t1 PARTITION (p1=0, p2='zero')").show(numRows = 30, truncate = false)
+--------------------------------+---------------------------------------------------------------------------------+-------+
|col_name                        |data_type                                                                        |comment|
+--------------------------------+---------------------------------------------------------------------------------+-------+
|id                              |int                                                                              |null   |
|p1                              |int                                                                              |null   |
|p2                              |string                                                                           |null   |
|# Partition Information         |                                                                                 |       |
|# col_name                      |data_type                                                                        |comment|
|p1                              |int                                                                              |null   |
|p2                              |string                                                                           |null   |
|                                |                                                                                 |       |
|# Detailed Partition Information|                                                                                 |       |
|Database                        |default                                                                          |       |
|Table                           |t1                                                                               |       |
|Partition Values                |[p1=0, p2=zero]                                                                  |       |
|Location                        |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1/p1=0/p2=zero                  |       |
|Serde Library                   |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe                      |       |
|InputFormat                     |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat                    |       |
|OutputFormat                    |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat                   |       |
|Storage Properties              |[path=file:/Users/jacek/dev/oss/spark/spark-warehouse/t1, serialization.format=1]|       |
|Partition Parameters            |{numFiles=1, transient_lastDdlTime=1514469540, totalSize=357}                    |       |
|Partition Statistics            |357 bytes, 1 rows                                                                |       |
|                                |                                                                                 |       |
|# Storage Information           |                                                                                 |       |
|Location                        |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1                               |       |
|Serde Library                   |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe                      |       |
|InputFormat                     |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat                    |       |
|OutputFormat                    |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat                   |       |
|Storage Properties              |[serialization.format=1]                                                         |       |
+--------------------------------+---------------------------------------------------------------------------------+-------+

// table-level statistics are in Statistics row

scala> sql("DESC EXTENDED t1").show(numRows = 30, truncate = false)

+----------------------------+--------------------------------------------------------------+-------+

|col_name |data_type |comment|

+----------------------------+--------------------------------------------------------------+-------+

|id |int |null |

|p1 |int |null |

|p2 |string |null |

|# Partition Information | | |

|# col_name |data_type |comment|

|p1 |int |null |

|p2 |string |null |

| | | |

|# Detailed Table Information| | |

|Database |default | |

|Table |t1 | |

|Owner |jacek | |

|Created Time |Wed Dec 27 14:10:44 CET 2017 | |

|Last Access |Thu Jan 01 01:00:00 CET 1970 | |

|Created By |Spark 2.3.0 | |

|Type |MANAGED | |

|Provider |parquet | |

|Table Properties |[transient_lastDdlTime=1514453141] | |

|Statistics |714 bytes, 2 rows | |

|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1 | |

|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |

|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |

|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |

|Storage Properties |[serialization.format=1] | |

|Partition Provider |Catalog | |

+----------------------------+--------------------------------------------------------------+-------+

scala> spark.table("t1").show

+---+---+----+

| id| p1| p2|

+---+---+----+

| 0| 0|zero|

| 1| 1| one|

+---+---+----+

// partition-level statistics are in Partition Statistics row

scala> sql("DESC EXTENDED t1 PARTITION (p1=0, p2='zero')").show(numRows = 30, truncate = false)

+--------------------------------+---------------------------------------------------------------------------------+-------+

|col_name |data_type |comment|

+--------------------------------+---------------------------------------------------------------------------------+-------+

|id |int |null |

|p1 |int |null |

|p2 |string |null |

|# Partition Information | | |

|# col_name |data_type |comment|

|p1 |int |null |

|p2 |string |null |

| | | |

|# Detailed Partition Information| | |

|Database |default | |

|Table |t1 | |

|Partition Values |[p1=0, p2=zero] | |

|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1/p1=0/p2=zero | |

|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |

|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |

|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |

|Storage Properties |[path=file:/Users/jacek/dev/oss/spark/spark-warehouse/t1, serialization.format=1]| |

|Partition Parameters |{numFiles=1, transient_lastDdlTime=1514469540, totalSize=357} | |

|Partition Statistics |357 bytes, 1 rows | |

| | | |

|# Storage Information | | |

|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1 | |

|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |

|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |

|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |

|Storage Properties |[serialization.format=1] | |

+--------------------------------+---------------------------------------------------------------------------------+-------+

You can view the statistics of a single column using DESC EXTENDED tableName columnName that are in a Dataset with two columns, i.e. info_name and info_value.



scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+


scala> sql("DESC EXTENDED t1 p1").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |p1        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+


scala> sql("DESC EXTENDED t1 p2").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |p2        |
|data_type     |string    |
|comment       |NULL      |
|min           |NULL      |
|max           |NULL      |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+

scala> sql("DESC EXTENDED t1 id").show

+--------------+----------+

|info_name |info_value|

+--------------+----------+

|col_name |id |

|data_type |int |

|comment |NULL |

|min |0 |

|max |1 |

|num_nulls |0 |

|distinct_count|2 |

|avg_col_len |4 |

|max_col_len |4 |

|histogram |NULL |

+--------------+----------+

scala> sql("DESC EXTENDED t1 p1").show

+--------------+----------+

|info_name |info_value|

+--------------+----------+

|col_name |p1 |

|data_type |int |

|comment |NULL |

|min |0 |

|max |1 |

|num_nulls |0 |

|distinct_count|2 |

|avg_col_len |4 |

|max_col_len |4 |

|histogram |NULL |

+--------------+----------+

scala> sql("DESC EXTENDED t1 p2").show

+--------------+----------+

|info_name |info_value|

+--------------+----------+

|col_name |p2 |

|data_type |string |

|comment |NULL |

|min |NULL |

|max |NULL |

|num_nulls |0 |

|distinct_count|2 |

|avg_col_len |4 |

|max_col_len |4 |

|histogram |NULL |

+--------------+----------+

Cost-Based Optimizations

The Spark Optimizer uses heuristics (rules) that are applied to a logical query plan for cost-based optimization.

Among the optimization rules are the following:

CostBasedJoinReorder logical optimization rule for join reordering with 2 or more consecutive inner or cross joins (possibly separated by Project operators) when spark.sql.cbo.enabled and spark.sql.cbo.joinReorder.enabled configuration properties are both enabled.

Logical Commands for Altering Table Statistics

The following are the logical commands that alter table statistics in a metastore (aka external catalog):

AnalyzeTableCommand
AnalyzeColumnCommand
AlterTableAddPartitionCommand
AlterTableDropPartitionCommand
AlterTableSetLocationCommand
TruncateTableCommand
InsertIntoHiveTable
InsertIntoHadoopFsRelationCommand
LoadDataCommand

EXPLAIN COST SQL Command

Caution

FIXME See LogicalPlanStats

LogicalPlanStats — Statistics Estimates of Logical Operator

LogicalPlanStats adds statistics support to logical operators and is used for query planning (with or without cost-based optimization, e.g. CostBasedJoinReorder or JoinSelection, respectively).

Equi-Height Histograms for Columns

From SPARK-17074 generate equi-height histogram for column:

Equi-height histogram is effective in handling skewed data distribution.

For equi-height histogram, the heights of all bins(intervals) are the same. The default number of bins we use is 254.

Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin intervals);
2. use a new aggregate function to get distinct counts in each of these bins.

Note that this method takes two table scans. In the future we may provide other algorithms which need only one table scan.

From [SPARK-17074] [SQL] Generate equi-height histogram in column statistics #19479:

Equi-height histogram is effective in cardinality estimation, and more accurate than basic column stats (min, max, ndv, etc) especially in skew distribution.

For equi-height histogram, all buckets (intervals) have the same height (frequency).

we use a two-step method to generate an equi-height histogram:

use ApproximatePercentile to get percentiles p(0), p(1/n), p(2/n) … p((n-1)/n), p(1);

construct range values of buckets, e.g. [p(0), p(1/n)], [p(1/n), p(2/n)] … [p((n-1)/n), p(1)], and use ApproxCountDistinctForIntervals to count ndv in each bucket. Each bucket is of the form: (lowerBound, higherBound, ndv).

Spark SQL uses column statistics that may optionally hold the histogram of values (which is empty by default). With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command generates column (equi-height) histograms.

Note	`spark.sql.statistics.histogram.enabled` is off by default.



// Computing column statistics with histogram
// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true
scala> spark.sessionState.conf.histogramEnabled
res1: Boolean = true

val tableName = "t1"

// Make the example reproducible
import org.apache.spark.sql.catalyst.TableIdentifier
val tid = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1
Seq((0, 0, "zero"), (1, 1, "one")).
  toDF("id", "p1", "p2").
  write.
  saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition files
// Invalidate cache
spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics
// that saves them in a metastore (aka an external catalog)
val df = spark.table(tableName)
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"
spark.sql(analyzeTableSQL)

// Column statistics with histogram should be in the external catalog (metastore)

// Computing column statistics with histogram

// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true

scala> spark.sessionState.conf.histogramEnabled

res1: Boolean = true

val tableName = "t1"

// Make the example reproducible

import org.apache.spark.sql.catalyst.TableIdentifier

val tid = TableIdentifier(tableName)

val sessionCatalog = spark.sessionState.catalog

sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1

Seq((0, 0, "zero"), (1, 1, "one")).

toDF("id", "p1", "p2").

write.

saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition files

// Invalidate cache

spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics

// that saves them in a metastore (aka an external catalog)

val df = spark.table(tableName)

val allCols = df.columns.mkString(",")

val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"

spark.sql(analyzeTableSQL)

// Column statistics with histogram should be in the external catalog (metastore)

You can inspect the column statistics using DESCRIBE EXTENDED SQL command.



// Inspecting column statistics with column histogram
// See the above example for how to compute the stats
val colName = "id"
val descExtSQL = s"DESC EXTENDED $tableName $colName"

// 254 bins by default --> num_of_bins in histogram row below
scala> sql(descExtSQL).show(truncate = false)
+--------------+-----------------------------------------------------+
|info_name     |info_value                                           |
+--------------+-----------------------------------------------------+
|col_name      |id                                                   |
|data_type     |int                                                  |
|comment       |NULL                                                 |
|min           |0                                                    |
|max           |1                                                    |
|num_nulls     |0                                                    |
|distinct_count|2                                                    |
|avg_col_len   |4                                                    |
|max_col_len   |4                                                    |
|histogram     |height: 0.007874015748031496, num_of_bins: 254       |
|bin_0         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_1         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_2         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_3         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_4         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_5         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_6         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_7         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_8         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_9         |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
+--------------+-----------------------------------------------------+
only showing top 20 rows

// Inspecting column statistics with column histogram

// See the above example for how to compute the stats

val colName = "id"

val descExtSQL = s"DESC EXTENDED $tableName $colName"

// 254 bins by default --> num_of_bins in histogram row below

scala> sql(descExtSQL).show(truncate = false)

+--------------+-----------------------------------------------------+

|info_name |info_value |

+--------------+-----------------------------------------------------+

|col_name |id |

|data_type |int |

|comment |NULL |

|min |0 |

|max |1 |

|num_nulls |0 |

|distinct_count|2 |

|avg_col_len |4 |

|max_col_len |4 |

|histogram |height: 0.007874015748031496, num_of_bins: 254 |

|bin_0 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_1 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_2 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_3 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_4 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_5 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_6 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_7 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_8 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

|bin_9 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|

+--------------+-----------------------------------------------------+

only showing top 20 rows

上一页
1
···
44
45
46
47
48
49
50
...
下一页
共 58 页

spark-sql 第47页

TextBasedFileFormat — Base for Text Splitable FileFormats

isSplitable Method

ParquetFileFormat

Preparing Write Job — prepareWrite Method

inferSchema Method

vectorTypes Method

Building Data Reader With Partition Column Values Appended — buildReaderWithPartitionValues Method

mergeSchemasInParallel Method

OrcFileFormat

buildReaderWithPartitionValues Method

inferSchema Method

Building Partitioned Data Reader — buildReader Method

FileFormat — Data Sources to Read and Write Data In Files

Building Data Reader With Partition Column Values Appended — buildReaderWithPartitionValues Method

Catalyst DSL — Implicit Conversions for Catalyst Data Structures

ImplicitOperators Implicit Conversions

ExpressionConversions Implicit Conversions

Type Conversions to Literal Expressions

Converting Symbols to UnresolvedAttribute and AttributeReference Expressions

Converting $-Prefixed String Literals to UnresolvedAttribute Expressions

Adding Aggregate And Non-Aggregate Functions to Expressions

Creating UnresolvedFunction Expressions — function and distinctFunction Methods

Creating AttributeReference Expressions With nullability On or Off — notNull and canBeNull Methods

Creating BoundReference — at Method

plans Implicit Conversions for Logical Plans

Creating UnresolvedHint Logical Operator — hint Method

Creating Join Logical Operator — join Method

Creating UnresolvedRelation Logical Operator — table Method

DslLogicalPlan Implicit Class

Analyzing Logical Plan — analyze Method

CommandUtils — Utilities for Table Statistics

Updating Existing Table Statistics — updateTableStats Method

Calculating Total Size of Table (with Partitions) — calculateTotalSize Method

Calculating Total File Size Under Path — calculateLocationSize Method

Creating CatalogStatistics with Current Statistics — compareAndGetNewStats Method

EstimationUtils

getOutputSize Method

nullColumnStat Method

Checking Availability of Row Count Statistic — rowCountsExist Method

ColumnStat — Column Statistics

Converting Value to External/Java Representation (per Catalyst Data Type) — toExternalString Internal Method

supportsHistogram Method

Converting ColumnStat to Properties (ColumnStat Serialization) — toMap Method

Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — fromMap Method

Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — rowToColumnStat Method

statExprs Method

CatalogStatistics — Table Statistics From External Catalog (Metastore)

Converting Metastore Statistics to Spark Statistics — toPlanStats Method

Cost-Based Optimization (CBO) of Logical Query Plan

Table Statistics

spark.sql.cbo.enabled Spark SQL Configuration Property

ANALYZE TABLE COMPUTE STATISTICS SQL Command

DESCRIBE EXTENDED SQL Command

Cost-Based Optimizations

Logical Commands for Altering Table Statistics

EXPLAIN COST SQL Command

LogicalPlanStats — Statistics Estimates of Logical Operator

Equi-Height Histograms for Columns

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

`isSplitable` Method

Preparing Write Job — `prepareWrite` Method

`inferSchema` Method

`vectorTypes` Method

Building Data Reader With Partition Column Values Appended — `buildReaderWithPartitionValues` Method

`mergeSchemasInParallel` Method

`buildReaderWithPartitionValues` Method

`inferSchema` Method

Building Partitioned Data Reader — `buildReader` Method

Building Data Reader With Partition Column Values Appended — `buildReaderWithPartitionValues` Method

`ImplicitOperators` Implicit Conversions

`ExpressionConversions` Implicit Conversions

Creating UnresolvedFunction Expressions — `function` and `distinctFunction` Methods

Creating AttributeReference Expressions With nullability On or Off — `notNull` and `canBeNull` Methods

Creating BoundReference — `at` Method

`plans` Implicit Conversions for Logical Plans

Creating UnresolvedHint Logical Operator — `hint` Method

Creating Join Logical Operator — `join` Method

Creating UnresolvedRelation Logical Operator — `table` Method

`DslLogicalPlan` Implicit Class

Analyzing Logical Plan — `analyze` Method

Updating Existing Table Statistics — `updateTableStats` Method

Calculating Total Size of Table (with Partitions) — `calculateTotalSize` Method

Calculating Total File Size Under Path — `calculateLocationSize` Method

Creating CatalogStatistics with Current Statistics — `compareAndGetNewStats` Method

`getOutputSize` Method

`nullColumnStat` Method

Checking Availability of Row Count Statistic — `rowCountsExist` Method

Converting Value to External/Java Representation (per Catalyst Data Type) — `toExternalString` Internal Method

`supportsHistogram` Method

Converting ColumnStat to Properties (ColumnStat Serialization) — `toMap` Method

Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — `fromMap` Method

Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — `rowToColumnStat` Method

`statExprs` Method

Converting Metastore Statistics to Spark Statistics — `toPlanStats` Method