'Pyspark udf to detect "Actors"
I have a matrix(dataframe) I want to find all the rows there the row and columns intersect with a '1'. (The 'Character' row value matches the column name)
Example. Sam is an actor. (He has a '1' in the column 'actor' and the row the 'character' value of 'actor'.) This would be a row I'm would want returned.
df = spark.createDataFrame(
[
("actor", "sam", "1", "0", "0", "0", "0"),
("villan", "jack", "0", "0", "0", "0", "0"),
("actress", "rose", "0", "0", "0", "1", "0"),
("comedian", "mike", "0", "1", "1", "0", "1"),
("musician", "young", "1", "1", "1", "1", "0")
],
["character", "name", "actor", "villan", "comedian", "actress", "musician"]
)
+---------+-----+-----+------+--------+-------+--------+
|character| name|actor|villan|comedian|actress|musician|
+---------+-----+-----+------+--------+-------+--------+
| actor| sam| 1| 0| 0| 0| 0|
| villan| jack| 0| 0| 0| 0| 0|
| actress| rose| 0| 0| 0| 1| 0|
| comedian| mike| 0| 1| 1| 0| 1|
| musician|young| 1| 1| 1| 1| 0|
+---------+-----+-----+------+--------+-------+--------+
Solution 1:[1]
#create function
def myMatch( needle, haystack ):
return haystack[needle]
#create udf
matched = udf(myMatch, StringType()) # your existing data is strings
#apply udf
df.select(\
df.name ,\
matched( \
df.character, \
f.struct( *[df[col] for col in df.columns] ) )\ # shortcut to add all columns to a struct so it can be passed to udf
.alias("IsPlayingCharacter") )\
.show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Matt Andruff |
