关注 spark技术分享,
撸spark源码 玩spark最佳实践

WholeStageCodegenExec

WholeStageCodegenExec Unary Physical Operator for Java Code Generation

WholeStageCodegenExec is a unary physical operator that is one of the two physical operators that lay the foundation for the Whole-Stage Java Code Generation for a Codegened Execution Pipeline of a structured query.

Note
InputAdapter is the other physical operator for Codegened Execution Pipeline of a structured query.

WholeStageCodegenExec itself supports the Java code generation and so when executed triggers code generation for the entire child physical plan subtree of a structured query.

Tip

Consider using Debugging Query Execution facility to deep dive into the whole-stage code generation.

Tip

Use the following to enable comments in generated code.

WholeStageCodegenExec is created when:

Note
spark.sql.codegen.wholeStage property is enabled by default.

WholeStageCodegenExec takes a single child physical operator (a physical subquery tree) and codegen stage ID when created.

Note
WholeStageCodegenExec requires that the single child physical operator supports Java code generation.

WholeStageCodegenExec marks the child physical operator with * (star) prefix and per-query codegen stage ID (in round brackets) in the text representation of a physical plan tree.

Note
As WholeStageCodegenExec is created as a result of CollapseCodegenStages physical query optimization rule, it is only executed in executedPlan phase of a query execution (that you can only notice by the * star prefix in a plan output).

When executed, WholeStageCodegenExec gives pipelineTime performance metric.

Table 1. WholeStageCodegenExec’s Performance Metrics
Key Name (in web UI) Description

pipelineTime

(empty)

Time of how long the whole-stage codegend pipeline has been running (i.e. the elapsed time since the underlying BufferedRowIterator had been created and the internal rows were all consumed).

spark sql WholeStageCodegenExec webui.png
Figure 1. WholeStageCodegenExec in web UI (Details for Query)
Tip
Use explain operator to know the physical plan of a query and find out whether or not WholeStageCodegen is in use.

Note
Physical plans that support code generation extend CodegenSupport.
Tip

Enable DEBUG logging level for org.apache.spark.sql.execution.WholeStageCodegenExec logger to see what happens inside.

Add the following line to conf/log4j.properties:

Refer to Logging.

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

Note
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).

doExecute generates the Java source code for the child physical plan subtree first and uses CodeGenerator to compile it right afterwards.

If compilation goes well, doExecute branches off per the number of input RDDs.

Note
doExecute only supports up to two input RDDs.
Caution
FIXME Finish the “success” path

If the size of the generated codes is greater than spark.sql.codegen.hugeMethodLimit (which defaults to 65535), doExecute prints out the following INFO message:

In the end, doExecute requests the child physical operator to execute (that triggers physical query planning and generates an RDD[InternalRow]) and returns it.

Note
doExecute skips requesting the child physical operator to execute for FileSourceScanExec leaf physical operator with supportsBatch flag enabled (as FileSourceScanExec operator uses WholeStageCodegenExec operator when FileSourceScanExec).

If compilation fails and spark.sql.codegen.fallback configuration property is enabled, doExecute prints out the following WARN message to the logs, requests the child physical operator to execute and returns it.

Generating Java Source Code for Child Physical Plan Subtree — doCodeGen Method

doCodeGen creates a new CodegenContext and requests the single child physical operator to generate a Java source code for produce code path (with the new CodegenContext and the WholeStageCodegenExec physical operator itself).

doCodeGen adds the new function under the name of processNext.

doCodeGen generates the final Java source code of the following format:

Note
doCodeGen requires that the single child physical operator supports Java code generation.

doCodeGen cleans up the generated code (using CodeFormatter to stripExtraNewLines, stripOverlappingComments).

doCodeGen prints out the following DEBUG message to the logs:

In the end, doCodeGen returns the CodegenContext and the Java source code (as a CodeAndComment).

Note

doCodeGen is used when:

Generating Java Source Code for Consume Path in Whole-Stage Code Generation — doConsume Method

Note
doConsume is part of CodegenSupport Contract to generate the Java source code for consume path in Whole-Stage Code Generation.

doConsume generates a Java source code that:

  1. Takes (from the input row) the code to evaluate a Catalyst expression on an input InternalRow

  2. Takes (from the input row) the term for a value of the result of the evaluation

    1. Adds .copy() to the term if needCopyResult is turned on

  3. Wraps the term inside append() code block

Generating Class Name — generatedClassName Method

generatedClassName gives a class name per spark.sql.codegen.useIdInClassName configuration property:

  • GeneratedIteratorForCodegenStage with the codegen stage ID when enabled (true)

  • GeneratedIterator when disabled (false)

Note
generatedClassName is used exclusively when WholeStageCodegenExec unary physical operator is requested to generate the Java source code for the child physical plan subtree.

isTooManyFields Object Method

isTooManyFields…​FIXME

Note
isTooManyFields is used when…​FIXME
赞(0) 打赏
未经允许不得转载:spark技术分享 » WholeStageCodegenExec
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏