spark-sql-spark技术分享-第37页

BoundReference Leaf Expression — Reference to Value in Internal Binary Row

BoundReference is a leaf expression that evaluates to a value in an internal binary row at a specified position and of a given data type.

BoundReference takes the following when created:

Ordinal, i.e. the position
Data type of the value
nullable flag that controls whether the value can be null or not



import org.apache.spark.sql.catalyst.expressions.BoundReference
import org.apache.spark.sql.types.LongType
val boundRef = BoundReference(ordinal = 0, dataType = LongType, nullable = true)

scala> println(boundRef.toString)
input[0, bigint, true]

import org.apache.spark.sql.catalyst.InternalRow
val row = InternalRow(1L, "hello")

val value = boundRef.eval(row).asInstanceOf[Long]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

import org.apache.spark.sql.catalyst.expressions.BoundReference

import org.apache.spark.sql.types.LongType

val boundRef = BoundReference(ordinal = 0, dataType = LongType, nullable = true)

scala> println(boundRef.toString)

input[0, bigint, true]

import org.apache.spark.sql.catalyst.InternalRow

val row = InternalRow(1L, "hello")

val value = boundRef.eval(row).asInstanceOf[Long]

You can also create a BoundReference using Catalyst DSL’s at method.



import org.apache.spark.sql.catalyst.dsl.expressions._
val boundRef = 'hello.string.at(4)
scala> println(boundRef)
input[4, string, true]

1

2

3

4

5

6

7

8

import org.apache.spark.sql.catalyst.dsl.expressions._

val boundRef = 'hello.string.at(4)

scala> println(boundRef)

input[4, string, true]

Evaluating Expression — `eval` Method



eval(input: InternalRow): Any

1

2

3

4

5

eval(input: InternalRow): Any

Note	`eval` is part of Expression Contract for the interpreted (non-code-generated) expression evaluation, i.e. evaluating a Catalyst expression to a JVM object for a given internal binary row.

eval gives the value at position from the input internal binary row that is of a correct type.

Internally, eval returns null if the value at the position is null.

Otherwise, eval uses the methods of InternalRow per the defined data type to access the value.

Table 1. eval’s DataType to InternalRow’s Methods Mapping (in execution order)
DataType	InternalRow’s Method
BooleanType	`getBoolean`
ByteType	`getByte`
ShortType	`getShort`
IntegerType or DateType	`getInt`
LongType or TimestampType	`getLong`
FloatType	`getFloat`
DoubleType	`getDouble`
StringType	`getUTF8String`
BinaryType	`getBinary`
CalendarIntervalType	`getInterval`
DecimalType	`getDecimal`
StructType	`getStruct`
ArrayType	`getArray`
MapType	`getMap`
others	`get(ordinal, dataType)`

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — `doGenCode` Method



doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode

1

2

3

4

5

doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode

Note	`doGenCode` is part of Expression Contract to generate a Java source code (ExprCode) for code-generated expression evaluation.

doGenCode…FIXME

`BindReferences.bindReference` Method



bindReference[A <: Expression](
  expression: A,
  input: AttributeSeq,
  allowFailures: Boolean = false): A

1

2

3

4

5

6

7

8

bindReference[A <: Expression](

expression: A,

input: AttributeSeq,

allowFailures: Boolean = false): A

bindReference…FIXME

Note	`bindReference` is used when…FIXME

Attribute — Base of Leaf Named Expressions

Attribute is the base of leaf named expressions.

Note	QueryPlan uses Attributes to build the schema of the query (it represents).



package org.apache.spark.sql.catalyst.expressions

abstract class Attribute extends ... {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def withMetadata(newMetadata: Metadata): Attribute
  def withName(newName: String): Attribute
  def withNullability(newNullability: Boolean): Attribute
  def withQualifier(newQualifier: Option[String]): Attribute
  def newInstance(): Attribute
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

package org.apache.spark.sql.catalyst.expressions

abstract class Attribute extends ... {

// only required properties (vals and methods) that have no implementation

// the others follow

def withMetadata(newMetadata: Metadata): Attribute

def withName(newName: String): Attribute

def withNullability(newNullability: Boolean): Attribute

def withQualifier(newQualifier: Option[String]): Attribute

def newInstance(): Attribute

}

Table 1. Attribute Contract
Property	Description
`withMetadata`
`withName`
`withNullability`
`withQualifier`
`newInstance`

When requested for references, Attribute gives the reference to itself only.

As a NamedExpression, Attribute gives the reference to itself only when requested for toAttribute.

Table 2. Attributes (Direct Implementations)
Attribute	Description
AttributeReference
PrettyAttribute
UnresolvedAttribute

As an optimization, Attribute is marked as to not tolerate nulls, and when given a null input produces a null output.

Alias Unary Expression

Alias is a unary expression and a named expression.

Alias is created when…FIXME

Creating Alias Instance

Alias takes the following when created:

Child expression
Name

AttributeReference

AttributeReference is…FIXME

AggregateWindowFunction Contract — Declarative Window Aggregate Function Expressions

2012-08-02admin阅读(3452)

AggregateWindowFunction Contract — Declarative Window Aggregate Function Expressions

AggregateWindowFunction is the extension of the DeclarativeAggregate Contract for declarative aggregate function expressions that are also WindowFunction expressions.



package org.apache.spark.sql.catalyst.expressions

abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction {
  self: Product =>
  // No required properties (vals and methods) that have no implementation
}

1

2

3

4

5

6

7

8

9

10

package org.apache.spark.sql.catalyst.expressions

abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction {

self: Product =>

// No required properties (vals and methods) that have no implementation

}

AggregateWindowFunction uses IntegerType as the data type of the result of evaluating itself.

AggregateWindowFunction is nullable by default.

As a WindowFunction expression, AggregateWindowFunction uses a SpecifiedWindowFrame (with the RowFrame frame type, the UnboundedPreceding lower and the CurrentRow upper frame boundaries) as the frame.

AggregateWindowFunction is a DeclarativeAggregate expression that does not support merging (two aggregation buffers together) and throws an UnsupportedOperationException whenever requested for it.



Window Functions do not support merging.

1

2

3

4

5

Window Functions do not support merging.

Table 1. AggregateWindowFunctions (Direct Implementations)
AggregateWindowFunction	Description
RankLike
RowNumberLike
SizeBasedWindowFunction	Window functions that require the size of the current window for calculation

AggregateFunction Contract — Aggregate Function Expressions

2012-08-01admin阅读(1889)

AggregateFunction Contract — Aggregate Function Expressions

AggregateFunction is the contract for Catalyst expressions that represent aggregate functions.

AggregateFunction is used wrapped inside a AggregateExpression (using toAggregateExpression method) when:

Analyzer resolves functions (for SQL mode)
…FIXME: Anywhere else?



import org.apache.spark.sql.functions.collect_list
scala> val fn = collect_list("gid")
fn: org.apache.spark.sql.Column = collect_list(gid)

import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
scala> val aggFn = fn.expr.asInstanceOf[AggregateExpression].aggregateFunction
aggFn: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction = collect_list('gid, 0, 0)

scala> println(aggFn.numberedTreeString)
00 collect_list('gid, 0, 0)
01 +- 'gid

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

import org.apache.spark.sql.functions.collect_list

scala> val fn = collect_list("gid")

fn: org.apache.spark.sql.Column = collect_list(gid)

import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression

scala> val aggFn = fn.expr.asInstanceOf[AggregateExpression].aggregateFunction

aggFn: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction = collect_list('gid, 0, 0)

scala> println(aggFn.numberedTreeString)

00 collect_list('gid, 0, 0)

01 +- 'gid

Note	Aggregate functions are not foldable, i.e. FIXME

Table 1. AggregateFunction Top-Level Catalyst Expressions
Name	Behaviour	Examples
DeclarativeAggregate
ImperativeAggregate
TypedAggregateExpression

AggregateFunction Contract



abstract class AggregateFunction extends Expression {
  def aggBufferSchema: StructType
  def aggBufferAttributes: Seq[AttributeReference]
  def inputAggBufferAttributes: Seq[AttributeReference]
  def defaultResult: Option[Literal] = None
}

1

2

3

4

5

6

7

8

9

10

abstract class AggregateFunction extends Expression {

def aggBufferSchema: StructType

def aggBufferAttributes: Seq[AttributeReference]

def inputAggBufferAttributes: Seq[AttributeReference]

def defaultResult: Option[Literal] = None

}

Table 2. AggregateFunction Contract
Method	Description
`aggBufferSchema`	Schema of an aggregation buffer to hold partial aggregate results. Used mostly in ScalaUDAF and AggregationIterator
`aggBufferAttributes`	AttributeReferences of an aggregation buffer to hold partial aggregate results. Used in: `AggregateExpression` for references `Expression`-based aggregate’s `bufferSchema` in DeclarativeAggregate …
`inputAggBufferAttributes`
`defaultResult`	Defaults to `None`.

Creating AggregateExpression for AggregateFunction — `toAggregateExpression` Method



toAggregateExpression(): AggregateExpression  (1)
toAggregateExpression(isDistinct: Boolean): AggregateExpression

1

2

3

4

5

6

toAggregateExpression(): AggregateExpression (1)

toAggregateExpression(isDistinct: Boolean): AggregateExpression

Calls the other toAggregateExpression with isDistinct disabled (i.e. false)

toAggregateExpression creates a AggregateExpression for the current AggregateFunction with Complete aggregate mode.

Note	`toAggregateExpression` is used in: `functions` object’s `withAggregateFunction` block to create a Column with AggregateExpression for a `AggregateFunction` FIXME

AggregateExpression

2012-07-31admin阅读(1836)

AggregateExpression — Unevaluable Expression Container for AggregateFunction

AggregateExpression is an unevaluable expression (i.e. with no support for eval and doGenCode methods) that acts as a container for an AggregateFunction.

AggregateExpression contains the following:

AggregateFunction
AggregateMode
isDistinct flag indicating whether this aggregation is distinct or not (e.g. whether SQL’s DISTINCT keyword was used for the aggregate function)
ExprId

AggregateExpression is created when:

Analyzer resolves AggregateFunctions (and creates an AggregateExpression with Complete aggregate mode for the functions)
UserDefinedAggregateFunction is created with isDistinct flag disabled or enabled
AggUtils is requested to planAggregateWithOneDistinct (and creates AggregateExpressions with Partial and Final aggregate modes for the functions)
Aggregator is requested for a TypedColumn (using Aggregator.toColumn)
AggregateFunction is wrapped in a AggregateExpression

Table 1. toString’s Prefixes per AggregateMode
Prefix	AggregateMode
`partial_`	`Partial`
`merge_`	`PartialMerge`
(empty)	`Final` or `Complete`

Table 2. AggregateExpression’s Properties
Name	Description
`canonicalized`	AggregateExpression with AggregateFunction expression `canonicalized` with the special `ExprId` as `0`.
`children`	AggregateFunction expression (for which `AggregateExpression` was created).
`dataType`	DataType of AggregateFunction expression
`foldable`	Disabled (i.e. `false`)
`nullable`	Whether or not AggregateFunction expression is nullable.
`references`	`AttributeSet` with the following: `references` of AggregateFunction when AggregateMode is `Partial` or `Complete` aggBufferAttributes of AggregateFunction when `PartialMerge` or `Final`
`resultAttribute`	Attribute that is: `AttributeReference` when AggregateFunction is itself resolved `UnresolvedAttribute` otherwise
`sql`	Requests AggregateFunction to generate SQL output (with isDistinct flag).
`toString`	Prefix per AggregateMode followed by AggregateFunction‘s `toAggString` (with isDistinct flag).

UnspecifiedDistribution

2012-07-30admin阅读(1733)

UnspecifiedDistribution

UnspecifiedDistribution is a Distribution that…FIXME

UnspecifiedDistribution specifies None for the required number of partitions.

Note	`None` for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of `200` partitions).

`createPartitioning` Method



createPartitioning(numPartitions: Int): Partitioning

1

2

3

4

5

createPartitioning(numPartitions: Int): Partitioning

Note	`createPartitioning` is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning…FIXME

OrderedDistribution

2012-07-29admin阅读(1613)

OrderedDistribution

OrderedDistribution is a Distribution that…FIXME

OrderedDistribution specifies None for the required number of partitions.

Note	`None` for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of `200` partitions).

OrderedDistribution is created when…FIXME

OrderedDistribution takes SortOrder expressions for ordering when created.

OrderedDistribution requires that the ordering expressions should not be empty (i.e. Nil).

`createPartitioning` Method



createPartitioning(numPartitions: Int): Partitioning

1

2

3

4

5

createPartitioning(numPartitions: Int): Partitioning

Note	`createPartitioning` is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning…FIXME

HashClusteredDistribution

2012-07-28admin阅读(1778)

HashClusteredDistribution

HashClusteredDistribution is a Distribution that creates a HashPartitioning for the hash expressions and a requested number of partitions.

HashClusteredDistribution specifies None for the required number of partitions.

Note	`None` for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of `200` partitions).

HashClusteredDistribution is created when the following physical operators are requested for a required child distribution:

CoGroupExec, ShuffledHashJoinExec, SortMergeJoinExec

HashClusteredDistribution takes hash expressions when created.

HashClusteredDistribution requires that the hash expressions should not be empty (i.e. Nil).

HashClusteredDistribution is used when:

EnsureRequirements is requested to add an ExchangeCoordinator for Adaptive Query Execution
HashPartitioning is requested to satisfies

`createPartitioning` Method



createPartitioning(numPartitions: Int): Partitioning

1

2

3

4

5

createPartitioning(numPartitions: Int): Partitioning

Note	`createPartitioning` is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning creates a HashPartitioning for the hash expressions and the input numPartitions.

spark-sql 第37页

BoundReference

BoundReference Leaf Expression — Reference to Value in Internal Binary Row

Evaluating Expression — `eval` Method

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — `doGenCode` Method

`BindReferences.bindReference` Method

Attribute

Attribute — Base of Leaf Named Expressions

Alias

Alias Unary Expression

Creating Alias Instance

AttributeReference

AttributeReference

AggregateWindowFunction Contract — Declarative Window Aggregate Function Expressions

AggregateWindowFunction Contract — Declarative Window Aggregate Function Expressions

AggregateFunction Contract — Aggregate Function Expressions

AggregateFunction Contract — Aggregate Function Expressions

AggregateFunction Contract

Creating AggregateExpression for AggregateFunction — `toAggregateExpression` Method

AggregateExpression

AggregateExpression — Unevaluable Expression Container for AggregateFunction

UnspecifiedDistribution

UnspecifiedDistribution

`createPartitioning` Method

OrderedDistribution

OrderedDistribution

`createPartitioning` Method

HashClusteredDistribution

HashClusteredDistribution

`createPartitioning` Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

spark-sql 第37页

BoundReference Leaf Expression — Reference to Value in Internal Binary Row

Evaluating Expression — eval Method

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — doGenCode Method

BindReferences.bindReference Method

Attribute — Base of Leaf Named Expressions

Alias Unary Expression

Creating Alias Instance

AttributeReference

AggregateWindowFunction Contract — Declarative Window Aggregate Function Expressions

AggregateFunction Contract — Aggregate Function Expressions

AggregateFunction Contract

Creating AggregateExpression for AggregateFunction — toAggregateExpression Method

AggregateExpression — Unevaluable Expression Container for AggregateFunction

UnspecifiedDistribution

createPartitioning Method

OrderedDistribution

createPartitioning Method

HashClusteredDistribution

createPartitioning Method

欢迎关注：spark技术分享

关注公众号：spark技术分享

QQ咨询

回顶部

Evaluating Expression — `eval` Method

Generating Java Source Code (ExprCode) For Code-Generated Expression Evaluation — `doGenCode` Method

`BindReferences.bindReference` Method

Creating AggregateExpression for AggregateFunction — `toAggregateExpression` Method

`createPartitioning` Method

`createPartitioning` Method

`createPartitioning` Method