CollapseCodegenStages

Each RDD call as composing an additional iterator on top of the previous RDD row iterator Turning an entire operations between a shuffle into one code block is known as WholeStageCodeGen.

The way this works is as follows:

CollapseCodeGenStages.apply() is invoked on a processed SparkPlan. We check to see if the SparkPlan supports CodeGen. If it does, we transform the plan to a WholeStageCodegenExec object, otherwise we try to break up the Plan recursively by checking to see if any of its children support CodeGen. Return the collapsedSparkPlan which is now either a WholeStageCodeGen or the same plan with some of it’s children recursively been collapsed to be WholeStageCodeGen plans. When doExecute() is called on the WholeStageCodegenExec object Spark creates java code for all of the operations that were collapsed into it (as we did above) and then it compiles that code using Janino, and then calls mapPartitions() with the generated class that uses bytecode for execution of the operations. These WholeStageCodeGen plans can sit within our tree as a normal stage would, it’s just a much more dense set of operations so execution otherwise proceeds as follows.

For details of WholeStageCodeGen, please refer to the section in Tungsten.

Spark-SQL

results matching ""

    No results matching ""