关注 spark技术分享,
撸spark源码 玩spark最佳实践

FilterExec

FilterExec Unary Physical Operator

FilterExec is a unary physical operator (i.e. with one child physical operator) that represents Filter and TypedFilter unary logical operators at execution.

FilterExec supports Java code generation (aka codegen) as follows:

FilterExec is created when:

Table 1. FilterExec’s Performance Metrics
Key Name (in web UI) Description

numOutputRows

number of output rows

spark sql FilterExec webui details for query.png
Figure 1. FilterExec in web UI (Details for Query)

FilterExec uses whatever the child physical operator uses for the input RDDs, the outputOrdering and the outputPartitioning.

FilterExec uses the PredicateHelper for…​FIXME

Table 2. FilterExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

notNullAttributes

FIXME

Used when…​FIXME

notNullPreds

FIXME

Used when…​FIXME

otherPreds

FIXME

Used when…​FIXME

Creating FilterExec Instance

FilterExec takes the following when created:

FilterExec initializes the internal registries and counters.

isNullIntolerant Internal Method

isNullIntolerant…​FIXME

Note
isNullIntolerant is used when…​FIXME

usedInputs Method

Note
usedInputs is part of CodegenSupport Contract to…​FIXME.

usedInputs…​FIXME

output Method

Note
output is part of QueryPlan Contract to…​FIXME.

output…​FIXME

Generating Java Source Code for Produce Path in Whole-Stage Code Generation — doProduce Method

Note
doProduce is part of CodegenSupport Contract to generate the Java source code for produce path in Whole-Stage Code Generation.

doProduce…​FIXME

Generating Java Source Code for Consume Path in Whole-Stage Code Generation — doConsume Method

Note
doConsume is part of CodegenSupport Contract to generate the Java source code for consume path in Whole-Stage Code Generation.

doConsume creates a new metric term for the numOutputRows metric.

doConsume…​FIXME

In the end, doConsume uses consume and FIXME to generate a Java source code (as a plain text) inside a do {…​} while(false); code block.

genPredicate Internal Method

Note
genPredicate is an internal method of doConsume.

genPredicate…​FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

Note
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).

doExecute executes the child physical operator and creates a new MapPartitionsRDD that does the filtering.

Internally, doExecute takes the numOutputRows metric.

In the end, doExecute requests the child physical operator to execute (that triggers physical query planning and generates an RDD[InternalRow]) and transforms it by executing the following function on internal rows per partition with index (using RDD.mapPartitionsWithIndexInternal that creates another RDD):

  1. Creates a partition filter as a new GenPredicate (for the filter condition expression and the output schema of the child physical operator)

  2. Requests the generated partition filter Predicate to initialize (with 0 partition index)

  3. Filters out elements from the partition iterator (Iterator[InternalRow]) by requesting the generated partition filter Predicate to evaluate for every InternalRow

    1. Increments the numOutputRows metric for positive evaluations (i.e. that returned true)

Note
doExecute (by RDD.mapPartitionsWithIndexInternal) adds a new MapPartitionsRDD to the RDD lineage. Use RDD.toDebugString to see the additional MapPartitionsRDD.
赞(0) 打赏
未经允许不得转载:spark技术分享 » FilterExec
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏