'How to add a new column to a PySpark DF based upon matching rows of a column to a list
I have a PySpark DF, which contains millions of instances of road names. Therefore I will not be able to convert to Pandas.
I also have a list that contains a small subset (around 400) of road names that are important.
I want to create a new column in the DF that flags whether the road name contains any of the string in the subset. For example, if the list contained an instance called "Portobello", then I want all rows in the DF that contain "Portobello" to flag with a number "1" in a new column. If the row does not contain then I want a "0" in the column. As long as it contains "Portobello", then the positioning does not matter i.e. "Blue Portobello Road" or "Portobello Green Lane" are both matches.
Once the column has been set as 1, then it cannot change based upon the run of other items in the list. But if it has been flagged as 0 previously, then it can changed based upon a match with other items in the list.
I am struggling to create this. If anyone be able to help - I will be very grateful!
Thank you
Solution 1:[1]
One way is to combine every elements in the subset separated by pipes | and use rlike inside a case clause
subset = ["Portobello", "Ladbroke", "Kensington Park"]
spark.sql("""
SELECT *,
case when address rlike '{}' then 1 else 0 end as flag
FROM values
(1, 'Portobello Green Lane'),
(2, 'Blue Portobello Road'),
(3, 'Clare Gardens Childrens Centre'),
(4, '6 Kensington Park Gardens') as tab(id, address)
""".format('|'.join(subset))
).show(truncate=False)
# +---+------------------------------+----+
# |id |address |flag|
# +---+------------------------------+----+
# |1 |Portobello Green Lane |1 |
# |2 |Blue Portobello Road |1 |
# |3 |Clare Gardens Childrens Centre|0 |
# |4 |6 Kensington Park Gardens |1 |
# +---+------------------------------+----+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | AdibP |
