'PySpark replace multiple words in string column based on values in array column

I have a dataframe that contains a string column with text of varied lengths, then I have an array column where each element is a struct with specified word, index, start position and end position in the text column. I want to replace words in the text column, that is in the array.

It looks like this:

- id:integer
- text:string
- text_entity:array
  - element:struct
    - word:string
    - index:integer
    - start:integer
    - end:integer

text example could be:

"I talked with Christian today at Cafe Heimdal last Wednesday"

text_entity example could be:

[{"word": "Christian", "index":4, "start":14, "end":23}, {"word": "Heimdal", "index":8, "start":38, "end":45}]

I then want to change the text to have the words at the above indexes replaced to:

"I talked with (BLEEP) today at Cafe (BLEEP) last Wednesday"

My initial approach was to explode the array and then do a regex_replace, but then there is the problem of collecting the text and merging them. And it seems like it would take a lot of operations. And I would like to not use UDFs, as performance is quite important. regex_replace also has the problem that it might match sub-strings, and that would not be okay. Therefore ideally the index, start, or end is used.



Solution 1:[1]

I came up with this answer using regexp_replace. Problem with using regex_replace however is that it will replace all occurrences, which is not the intention as a word could appear multiple time in the text, and only some of the occurrences should be bleeped

df = df.withColumn("temp_entities", F.expr(f"transform(text_entity, (x, i) -> x.word)")) \
    .withColumn("temp_entities", F.array_distinct("temp_entities")) \
    .withColumn("regex_expression", F.concat_ws("|", "temp_entities")) \
    .withColumn("regex_expression", F.concat(F.lit("\\b("), F.col("regex_expression"), F.lit(")\\b"))) \
    .withColumn("text", F.when(F.size("text_entity") > 0, F.expr("regexp_replace(text, regex_expression, '(BLEEP)')")).otherwise(F.col(text)))

It removes duplicates, and only applies regexp_replace if there are at least 1 entity. Probably not the most elegant solution, and will bleep all occurrences of the word. Ideally the position should be used.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 blackbishop