Whole-Stage Java Code Generation (Whole-Stage CodeGen)
Whole-Stage Java Code Generation (aka Whole-Stage CodeGen) is a physical query optimization in Spark SQL that fuses multiple physical operators (as a subtree of plans that support code generation) together into a single Java function.
Whole-Stage Java Code Generation improves the execution performance of a query by collapsing a query tree into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data.
Note
|
Whole-Stage Code Generation is controlled by spark.sql.codegen.wholeStage Spark internal property. Whole-Stage Code Generation is enabled by default.
Use SQLConf.wholeStageEnabled method to access the current value.
|
Note
|
Whole-Stage Code Generation is used by some modern massively parallel processing (MPP) databases to achieve a better query execution performance. |
Note
|
Janino is used to compile a Java source code into a Java class at runtime. |
Before a query is executed, CollapseCodegenStages physical preparation rule finds the physical query plans that support codegen and collapses them together as WholeStageCodegen
(possibly with InputAdapter in-between for physical operators with no support for Java code generation).
Note
|
CollapseCodegenStages is part of the sequence of physical preparation rules QueryExecution.preparations that will be applied in order to the physical plan before execution.
|
There are the following code generation paths (as coined in this commit):
-
Non-whole-stage-codegen path
-
Whole-stage-codegen “produce” path
-
Whole-stage-codegen “consume” path
Tip
|
Review SPARK-12795 Whole stage codegen to learn about the work to support it. |
BenchmarkWholeStageCodegen — Performance Benchmark
BenchmarkWholeStageCodegen
class provides a benchmark to measure whole stage codegen performance.
You can execute it using the command:
1 2 3 4 5 |
build/sbt 'sql/testOnly *BenchmarkWholeStageCodegen' |
Note
|
You need to un-ignore tests in BenchmarkWholeStageCodegen by replacing ignore with test .
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
$ build/sbt 'sql/testOnly *BenchmarkWholeStageCodegen' ... Running benchmark: range/limit/sum Running case: range/limit/sum codegen=false 22:55:23.028 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Running case: range/limit/sum codegen=true Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03 on Mac OS X 10.10.5 Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz range/limit/sum: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- range/limit/sum codegen=false 376 / 433 1394.5 0.7 1.0X range/limit/sum codegen=true 332 / 388 1581.3 0.6 1.1X [info] - range/limit/sum (10 seconds, 74 milliseconds) |