'org.apache.spark.SparkException: Failed to execute user defined function($anonfun$last3daysMean$1: (string) => double)

val columns = 
  Seq("userid","event_first_login_time","event_last_login_time","event_retention_str")
  
val data = Seq(
("780224", "2020-12-16","2021-08-05","1,5,1,180,1,44"),
   
("780225", "2020-12-16","2021-05-06",",1,2,1,3,14,1,13,2,5,1,28,1,29,4,1,8,1,18,2,1"))
  val df = data.toDF(columns:_*)
  df.show(false)
  
def last3daysMean: (String) => (Double) = (str: String) => {

    val list = str.split(",")
    var total = 0.0
     for (i <- str.length-3 to  str.length-1) {
         total += list(i).toInt;
      }
      var avg= total/3
      avg  }

val convertUDF = udf(last3daysMean)

var df4 = df.withColumn("last3days_mean", convertUDF(col("event_retention_str")))

df4.show(10,false)


Solution 1:[1]

The issue you are seeing is because of an indexing error. In your for loop you are indexing against the input string, not against the array split against the ','. If you read through the stack trace there will be an index out of bounds exception in their somewhere. A more idiomatic version of your function would look like:

def last3daysMean: (String) => (Double) = (str: String) => {
  str.split(",")
    .takeRight(3)
    .map(_.toInt)
    .sum / 3.0
 }

When using UDFs be care of null values. Your df4 expression should include a when...otherwise check to make sure that you don't pass any nulls into the UDF.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jarrod Baker