'java.lang.ArrayIndexOutOfBoundsException: 1 while saving data frame in spark Scala
In EMR, we are using Salesforce Bulk API call to fetch records from salesforce object. For one of the Object(TASK) data frame while saving to parquet getting below error.
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:174)
at org.apache.spark.sql.Row$class.apply(Row.scala:163)
at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(rows.scala:166)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
val sfdcObjectSchema = StructType(
nonCompoundMetas.map(_.Name).map(
fieldName => StructField(fieldName, StringType, true)
)
)
val sfdcObjectDF = spark.read.format("com.springml.spark.salesforce").option("username", userName).
option("password", s"$sfdcPassword$sfdcToken").option("soql", retrievingSOQL).
option("version", JavaUtils.getConfigProps(runtimeEnvironment).getProperty("sfdc.api.version")).
option("sfObject", sfdcObject).option("bulk", "true").option("pkChunking", pkChunking).
option("chunkSize", checkingSize).
option("timeout", bulkTimeoutMillis.toString).option("maxCharsPerColumn", "-1").option("maxColumns", nonCompoundMetas.size.toString).
schema(sfdcObjectSchema).load()
sfdcObjectDF.na.drop("all").write.mode(SaveMode.Overwrite).parquet(s"${JavaUtils.getConfigProps(runtimeEnvironment).getProperty("etl.dataset.root")}/$accountName/$sfdcObject")
Please help us how to debug this issue further.
Solution 1:[1]
This issue is caused by that your Salesforce "SOQL" returns an empty resultset, which trigger this runtime error.
The root cause I believe is that when https://github.com/springml/spark-salesforce designs the Spark DataSource API, it doesn't handle for the empty DataFrame case, so this bug exists. Maybe you want to create an issue in the git to raise this up.
For a temporary solution, you "could" use a "select count(id) ...." SOQL, make sure the result is "> 0", before you generate the Dataframe and use it in the Spark.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Yong Zhang |
