'How to find position of substring column in a another column using PySpark?

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?

Input data:

+---------------------------+---------+
|           text            | subtext | 
+---------------------------+---------+
| Where is my string?       | is      |
| Hm, this one is different | on      |
+---------------------------+---------+

Expected output:

+---------------------------+---------+----------+
|           text            | subtext | position |
+---------------------------+---------+----------+
| Where is my string?       | is      |       6  |
| Hm, this one is different | on      |       9  |
+---------------------------+---------+----------+

Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.



Solution 1:[1]

You can use locate. You need to subtract 1 because string index starts from 1, not 0.

import pyspark.sql.functions as F

df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))

df2.show(truncate=False)
+-------------------------+-------+--------+
|text                     |subtext|position|
+-------------------------+-------+--------+
|Where is my string?      |is     |6       |
|Hm, this one is different|on     |9       |
+-------------------------+-------+--------+

Solution 2:[2]

Another way using position SQL function :

from pyspark.sql.functions import expr

df1 = df.withColumn('position', expr("position(subtext in text) -1"))

df1.show(truncate=False)

#+-------------------------+-------+--------+
#|text                     |subtext|position|
#+-------------------------+-------+--------+
#|Where is my string?      |is     |6       |
#|Hm, this one is different|on     |9       |
#+-------------------------+-------+--------+

Solution 3:[3]

pyspark.sql.functions.instr(str, substr)

Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.

import pyspark.sql.functions as F
df.withColumn('pos',F.instr(df["text"], df["subtext"]))

Solution 4:[4]

You can use locate itself. The problem is first parameter of locate (substr) should be string.

So you can use expr function to convert column to string

Please find the correct code as below:

df=input_df.withColumn("poss", F.expr("locate(subtext,text,1)"))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mck
Solution 2 blackbishop
Solution 3 nobody
Solution 4 saranya rasamani