关注 spark技术分享,
撸spark源码 玩spark最佳实践

ColumnarBatch

admin阅读(1471)

ColumnarBatch

ColumnarBatch is…​FIXME

ColumnarBatch is created when:

  • InMemoryTableScanExec is requested to createAndDecompressColumn

  • VectorizedParquetRecordReader is requested to initBatch

  • OrcColumnarBatchReader is requested to initBatch

  • ColumnVectorUtils is requested to toBatch

  • ArrowPythonRunner is requested for a Iterator[ColumnarBatch] (i.e. newReaderIterator)

  • ArrowConverters is requested for a ArrowRowIterator (i.e. fromPayloadIterator)

Note

ColumnarBatch is an Evolving contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release.

In other words, using the contract is as treading on thin ice.

ColumnarBatch takes an array of ColumnVectors when created. ColumnarBatch initializes the internal MutableColumnarRow.

Table 1. ColumnarBatch’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

numRows

Number of rows

row

MutableColumnarRow

BytesToBytesMap Append-Only Hash Map

admin阅读(1461)

BytesToBytesMap Append-Only Hash Map

BytesToBytesMap is…​FIXME

  • Low space overhead,

  • Good memory locality, esp. for scans.

lookup Method

Caution
FIXME

safeLookup Method

safeLookup…​FIXME

Note
safeLookup is used when BytesToBytesMap does lookup and UnsafeHashedRelation for looking up a single value or values by key.

GenerateSafeProjection

admin阅读(1370)

GenerateSafeProjection

GenerateSafeProjection is…​FIXME

Creating Projection — create Method

Note
create is part of CodeGenerator Contract to…​FIXME.

create…​FIXME

GeneratePredicate

admin阅读(1235)

GeneratePredicate

GeneratePredicate is…​FIXME

Creating Predicate — create Method

Note
create is part of CodeGenerator Contract to…​FIXME.

create…​FIXME

GenerateOrdering

admin阅读(1339)

GenerateOrdering

GenerateOrdering is…​FIXME

Creating BaseOrdering — create Method

Note
create is part of CodeGenerator Contract to…​FIXME.

create…​FIXME

genComparisons Method

genComparisons…​FIXME

Note
genComparisons is used when…​FIXME

GenerateColumnAccessor

admin阅读(1348)

GenerateColumnAccessor

GenerateColumnAccessor is a CodeGenerator for…​FIXME

Creating ColumnarIterator — create Method

Note
create is part of CodeGenerator Contract to…​FIXME.

create…​FIXME

CodeGenerator

admin阅读(1682)

CodeGenerator

CodeGenerator is a base class for generators of JVM bytecode for expression evaluation.

Table 1. CodeGenerator’s Internal Properties
Name Description

cache

Guava’s LoadingCache with at most 100 pairs of CodeAndComment and GeneratedClass.

genericMutableRowType

Tip

Enable INFO or DEBUG logging level for org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator logger to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

CodeGenerator Contract

Table 2. CodeGenerator Contract
Method Description

generate

Generates an evaluator for expression(s) that may (optionally) have expression(s) bound to a schema (i.e. a collection of Attribute).

Used in:

Compiling Java Source Code using Janino — doCompile Internal Method

Caution
FIXME

Finding or Compiling Java Source Code — compile Method

Caution
FIXME

create Method

Caution
FIXME
Note

create is used when:

Creating CodegenContext — newCodeGenContext Method

newCodeGenContext simply creates a new CodegenContext.

Note

newCodeGenContext is used when:

CodegenContext

admin阅读(1420)

CodegenContext

CodegenContext is…​FIXME

CodegenContext takes no input parameters.

CodegenContext is created when:

CodegenContext stores expressions that don’t support codegen.

Table 1. CodegenContext’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

classFunctions

Mutable Scala Map with function names, their Java source code and a class name

New entries are added when CodegenContext is requested to addClass and addNewFunctionToClass

Used when CodegenContext is requested to declareAddedFunctions

equivalentExpressions

Expressions are added and then fetched as equivalent sets when CodegenContext is requested to subexpressionElimination (for generateExpressions with subexpression elimination enabled)

currentVars

The list of generated columns as input of current operator

INPUT_ROW

The variable name of the input row of the current operator

placeHolderToComments

Placeholders and their comments

Used when…​FIXME

references

References that are used to generate classes in the following code generators:

subExprEliminationExprs

SubExprEliminationStates by Expression

Used when…​FIXME

subexprFunctions

Names of the functions that…​FIXME

Generating Java Source Code For Code-Generated Evaluation of Multiple Expressions (With Optional Subexpression Elimination) — generateExpressions Method

(only with subexpression elimination enabled) generateExpressions does subexpressionElimination of the input expressions.

In the end, generateExpressions requests every expressions to generate the Java source code for code-generated (non-interpreted) expression evaluation.

Note

generateExpressions is used when:

addReferenceObj Method

addReferenceObj…​FIXME

Note
addReferenceObj is used when…​FIXME

subexpressionEliminationForWholeStageCodegen Method

subexpressionEliminationForWholeStageCodegen…​FIXME

Note
subexpressionEliminationForWholeStageCodegen is used exclusively when HashAggregateExec is requested to generate a Java source code for whole-stage consume path (with grouping keys or not).

Adding Function to Generated Class — addNewFunction Method

addNewFunction…​FIXME

Note
addNewFunction is used when…​FIXME

subexpressionElimination Internal Method

subexpressionElimination requests EquivalentExpressions to addExprTree for every expression (in the input expressions).

subexpressionElimination requests EquivalentExpressions for the equivalent sets of expressions with at least two equivalent expressions (aka common expressions).

For every equivalent expression set, subexpressionElimination does the following:

  1. Takes the first expression and requests it to generate a Java source code for the expression tree

  2. addNewFunction and adds it to subexprFunctions

  3. Creates a SubExprEliminationState and adds it with every common expression in the equivalent expression set to subExprEliminationExprs

Note
subexpressionElimination is used exclusively when CodegenContext is requested to generateExpressions (with subexpression elimination enabled).

Adding Mutable State — addMutableState Method

addMutableState…​FIXME

Note
addMutableState is used when…​FIXME

Adding Immutable State (Unless Exists Already) — addImmutableStateIfNotExists Method

addImmutableStateIfNotExists…​FIXME

Note
addImmutableStateIfNotExists is used when…​FIXME

freshName Method

freshName…​FIXME

Note
freshName is used when…​FIXME

addNewFunctionToClass Internal Method

addNewFunctionToClass…​FIXME

Note
addNewFunctionToClass is used when…​FIXME

addClass Internal Method

addClass…​FIXME

Note
addClass is used when…​FIXME

declareAddedFunctions Method

declareAddedFunctions…​FIXME

Note
declareAddedFunctions is used when…​FIXME

declareMutableStates Method

declareMutableStates…​FIXME

Note
declareMutableStates is used when…​FIXME

initMutableStates Method

initMutableStates…​FIXME

Note
initMutableStates is used when…​FIXME

initPartition Method

initPartition…​FIXME

Note
initPartition is used when…​FIXME

emitExtraCode Method

emitExtraCode…​FIXME

Note
emitExtraCode is used when…​FIXME

addPartitionInitializationStatement Method

addPartitionInitializationStatement…​FIXME

Note
addPartitionInitializationStatement is used when…​FIXME

Whole-Stage Java Code Generation (Whole-Stage CodeGen)

admin阅读(2065)

Whole-Stage Java Code Generation (Whole-Stage CodeGen)

Whole-Stage Java Code Generation (aka Whole-Stage CodeGen) is a physical query optimization in Spark SQL that fuses multiple physical operators (as a subtree of plans that support code generation) together into a single Java function.

Whole-Stage Java Code Generation improves the execution performance of a query by collapsing a query tree into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data.

Note

Whole-Stage Code Generation is controlled by spark.sql.codegen.wholeStage Spark internal property.

Whole-Stage Code Generation is enabled by default.

Use SQLConf.wholeStageEnabled method to access the current value.

Note

Whole-Stage Code Generation is used by some modern massively parallel processing (MPP) databases to achieve a better query execution performance.

Note
Janino is used to compile a Java source code into a Java class at runtime.

Before a query is executed, CollapseCodegenStages physical preparation rule finds the physical query plans that support codegen and collapses them together as WholeStageCodegen (possibly with InputAdapter in-between for physical operators with no support for Java code generation).

Note
CollapseCodegenStages is part of the sequence of physical preparation rules QueryExecution.preparations that will be applied in order to the physical plan before execution.

There are the following code generation paths (as coined in this commit):

  1. Non-whole-stage-codegen path

  1. Whole-stage-codegen “produce” path

  1. Whole-stage-codegen “consume” path

Tip
Review SPARK-12795 Whole stage codegen to learn about the work to support it.

BenchmarkWholeStageCodegen — Performance Benchmark

BenchmarkWholeStageCodegen class provides a benchmark to measure whole stage codegen performance.

You can execute it using the command:

Note
You need to un-ignore tests in BenchmarkWholeStageCodegen by replacing ignore with test.

关注公众号:spark技术分享

联系我们联系我们