'Spark - Replace first occurrence in a string

I want to use the replaceFirst() function in spark scala sql. or Is it possible to use the replaceFirst() function in spark scala dataframe?

Is this possible without using a UDF?

The function I want to do is:

println("abcdefgbchijkl".replaceFirst("bc","**BC**"))
// a**BC**defgbchijkl

However, the Column Type of DataFrame cannot be applied with Function:

var test0 = Seq("abcdefgbchijkl").toDF("col0")

test0
.select(col("col0").replaceFirst("bc","**BC**"))
.show(false)
/*
<console>:230: error: value replaceFirst is not a member of org.apache.spark.sql.Column
       .select(col("col0").replaceFirst("bc","**BC**"))
*/

Also, I don't know how to use it in SQL form:

%sql
-- How to use replaceFirst()
select replaceFirst()


Solution 1:[1]

Replacing the first occurrence isn't something I can see supported out of the box by Spark, but it is possible by combining a few functions:

Spark >= 3.0.0

import org.apache.spark.sql.functions.{array_join, col, split}

val test0 = Seq("abcdefgbchijkl").toDF("col0") // replaced `var` with `val`

val stringToReplace = "bc"
val replacement = "**BC**"

test0
  // create a temporary column, splitting the string by the first occurrence of `bc`
  .withColumn("temp", split(col("col0"), stringToReplace, 2))
  // recombine the strings before and after `bc` with the desired replacement
  .withColumn("col0", array_join(col("temp"), replacement))
  // we no longer need this `temp` column
  .drop(col("temp"))
  .show(false)

gives:

+------------------+
|col0              |
+------------------+
|a**BC**defgbchijkl|
+------------------+

For (spark) SQL:

-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
  -- create a temporary column, splitting the string by the first occurrence of `bc`
  SELECT split(col0, "bc", 2) AS tempr
  FROM (
    SELECT 'abcdefgbchijkl' AS col0
  )
)

Spark < 3.0.0 (pre 2020, tested using Spark 2.4.5)

val test0 = Seq("abcdefgbchijkl").toDF("col0")

val stringToReplace = "bc"
val replacement = "**BC**"

val splitFirst = udf { (s: String) => s.split(stringToReplace, 2) }

spark.udf.register("splitFirst", splitFirst) // if you're using Spark SQL

test0
  // create a temporary column, splitting the string by the first occurrence of `bc`
  .withColumn("temp", splitFirst(col("col0")))
  // recombine the strings before and after `bc` with the desired replacement
  .withColumn("col0", array_join(col("temp"), replacement))
  // we no longer need this `temp` column
  .drop(col("temp"))
  .show(false)

gives:

+------------------+
|col0              |
+------------------+
|a**BC**defgbchijkl|
+------------------+

For (spark) SQL:

-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
  -- create a temporary column, splitting the string by the first occurrence of `bc`
  SELECT splitFirst(col0) AS tempr -- `splitFirst` was registered above
  FROM (
    SELECT 'abcdefgbchijkl' AS col0
  )
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1