Debugging Query Execution-spark技术分享

Debugging Query Execution

debug package object contains tools for debugging query execution, i.e. a full analysis of structured queries (as Datasets).

Table 1. Debugging Query Execution Tools (debug Methods)

Method

Description

debug

Debugging a structured query



debug(): Unit

debug(): Unit

debugCodegen

Displays the Java source code generated for a structured query in whole-stage code generation (i.e. the output of each WholeStageCodegen subtree in a query plan).



debugCodegen(): Unit

debugCodegen(): Unit

debug package object is in org.apache.spark.sql.execution.debug package that you have to import before you can use the debug and debugCodegen methods.



// Import the package object
import org.apache.spark.sql.execution.debug._

// Every Dataset (incl. DataFrame) has now the debug and debugCodegen methods
val q: DataFrame = ...
q.debug
q.debugCodegen

// Import the package object

import org.apache.spark.sql.execution.debug._

// Every Dataset (incl. DataFrame) has now the debug and debugCodegen methods

val q: DataFrame = ...

q.debug

q.debugCodegen

Tip	Read up on Package Objects in the Scala programming language.

Internally, debug package object uses DebugQuery implicit class that “extends” Dataset[_] Scala type with the debug methods.



implicit class DebugQuery(query: Dataset[_]) {
  def debug(): Unit = ...
  def debugCodegen(): Unit = ...
}

implicit class DebugQuery(query: Dataset[_]) {

def debug(): Unit = ...

def debugCodegen(): Unit = ...

}

Tip	Read up on Implicit Classes in the official documentation of the Scala programming language.

Debugging Dataset — `debug` Method



debug(): Unit

debug(): Unit

debug requests the QueryExecution (of the structured query) for the optimized physical query plan.

debug transforms the optimized physical query plan to add a new DebugExec physical operator for every physical operator.

debug requests the query plan to execute and then counts the number of rows in the result. It prints out the following message:



Results returned: [count]

Results returned: [count]

In the end, debug requests every DebugExec physical operator (in the query plan) to dumpStats.



val q = spark.range(10).where('id === 4)

scala> :type q
org.apache.spark.sql.Dataset[Long]

// Extend Dataset[Long] with debug and debugCodegen methods
import org.apache.spark.sql.execution.debug._

scala> q.debug
Results returned: 1
== WholeStageCodegen ==
Tuples output: 1
 id LongType: {java.lang.Long}
== Filter (id#0L = 4) ==
Tuples output: 0
 id LongType: {}
== Range (0, 10, step=1, splits=8) ==
Tuples output: 0
 id LongType: {}

val q = spark.range(10).where('id === 4)

scala> :type q

org.apache.spark.sql.Dataset[Long]

// Extend Dataset[Long] with debug and debugCodegen methods

import org.apache.spark.sql.execution.debug._

scala> q.debug

Results returned: 1

== WholeStageCodegen ==

Tuples output: 1

id LongType: {java.lang.Long}

== Filter (id#0L = 4) ==

Tuples output: 0

id LongType: {}

== Range (0, 10, step=1, splits=8) ==

Tuples output: 0

id LongType: {}

Displaying Java Source Code Generated for Structured Query in Whole-Stage Code Generation (“Debugging” Codegen) — `debugCodegen` Method



debugCodegen(): Unit

debugCodegen(): Unit

debugCodegen requests the QueryExecution (of the structured query) for the optimized physical query plan.

In the end, debugCodegen simply codegenString the query plan and prints it out to the standard output.



import org.apache.spark.sql.execution.debug._

scala> spark.range(10).where('id === 4).debugCodegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*Filter (id#29L = 4)
+- *Range (0, 10, splits=8)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
...

import org.apache.spark.sql.execution.debug._

scala> spark.range(10).where('id === 4).debugCodegen

Found 1 WholeStageCodegen subtrees.

== Subtree 1 / 1 ==

*Filter (id#29L = 4)

+- *Range (0, 10, splits=8)

Generated code:

/* 001 */ public Object generate(Object[] references) {

/* 002 */ return new GeneratedIterator(references);

/* 003 */ }

/* 004 */

/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {

/* 006 */ private Object[] references;

...

Note

debugCodegen is equivalent to using debug interface of the QueryExecution.



val q = spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)
scala> q.queryExecution.debug.codegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*Project [(id#3L + 6) AS (((id + 1) + 2) + 3)#6L, (id#3L + 15) AS (((id + 4) + 5) + 6)#7L]
+- *Range (1, 1000, step=1, splits=8)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
...

val q = spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)

scala> q.queryExecution.debug.codegen

Found 1 WholeStageCodegen subtrees.

== Subtree 1 / 1 ==

*Project [(id#3L + 6) AS (((id + 1) + 2) + 3)#6L, (id#3L + 15) AS (((id + 4) + 5) + 6)#7L]

+- *Range (1, 1000, step=1, splits=8)

Generated code:

/* 001 */ public Object generate(Object[] references) {

/* 002 */ return new GeneratedIterator(references);

/* 003 */ }

/* 004 */

/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {

...

`codegenToSeq` Method



codegenToSeq(): Seq[(String, String)]

codegenToSeq(): Seq[(String, String)]

codegenToSeq…FIXME

Note	`codegenToSeq` is used when…FIXME

`codegenString` Method



codegenString(plan: SparkPlan): String

codegenString(plan: SparkPlan): String

codegenString…FIXME

Note	`codegenString` is used when…FIXME

Debugging Query Execution