'How to dump generated Java code to stdout?
Using DataFrames on Apache Spark 2.+, is there a way to get the underlying RDDs and dump the generated Java code to the console?
Solution 1:[1]
This can be done using QueryExecution.debug.codegen. This value is accessible on Dataframe/Dataset via .queryExecution (which is a "Developer API", that is, unstable, subject to breakage, and so should only be used for debugging). This works on Spark 2.4.0, and, from the code, seems like it should work since 2.0.0 (or more):
scala> val df = spark.range(1000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.queryExecution.debug.codegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*(1) Range (0, 1000, step=1, splits=12)
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */ private Object[] references;
/* 008 */ private scala.collection.Iterator[] inputs;
/* 009 */ private boolean range_initRange_0;
/* 010 */ private long range_number_0;
/* 011 */ private TaskContext range_taskContext_0;
/* 012 */ private InputMetrics range_inputMetrics_0;
/* 013 */ private long range_batchEnd_0;
/* 014 */ private long range_numElementsTodo_0;
/* 015 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
...
/* 104 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(range_nextBatchTodo_0);
/* 105 */ range_inputMetrics_0.incRecordsRead(range_nextBatchTodo_0);
/* 106 */
/* 107 */ range_batchEnd_0 += range_nextBatchTodo_0 * 1L;
/* 108 */ }
/* 109 */ }
/* 110 */
/* 111 */ }
Solution 2:[2]
Here is a way to output the generated code, probably there are other ways to do it:
import org.apache.spark.sql.execution.command.ExplainCommand
val explain = ExplainCommand(df.queryExecution.logical, codegen=true)
spark.sessionState.executePlan(explain).executedPlan.executeCollect().foreach {
r => println(r.getString(0))
}
Solution 3:[3]
Another option is to enable INFO level for spark logs and you should see generated java in logs
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | huon |
| Solution 2 | Raphael Roth |
| Solution 3 | babkamen |
