'Pyspark udf to detect "Actors"

I have a matrix(dataframe) I want to find all the rows there the row and columns intersect with a '1'. (The 'Character' row value matches the column name)

Example. Sam is an actor. (He has a '1' in the column 'actor' and the row the 'character' value of 'actor'.) This would be a row I'm would want returned.

df = spark.createDataFrame(
    [
        ("actor", "sam", "1", "0", "0", "0", "0"),  
        ("villan", "jack", "0", "0", "0", "0", "0"),
        ("actress", "rose", "0", "0", "0", "1", "0"),
        ("comedian", "mike", "0", "1", "1", "0", "1"),
        ("musician", "young", "1", "1", "1", "1", "0")
    ],
    ["character", "name", "actor", "villan", "comedian", "actress", "musician"]  
)
+---------+-----+-----+------+--------+-------+--------+
|character| name|actor|villan|comedian|actress|musician|
+---------+-----+-----+------+--------+-------+--------+
|    actor|  sam|    1|     0|       0|      0|       0|
|   villan| jack|    0|     0|       0|      0|       0|
|  actress| rose|    0|     0|       0|      1|       0|
| comedian| mike|    0|     1|       1|      0|       1|
| musician|young|    1|     1|       1|      1|       0|
+---------+-----+-----+------+--------+-------+--------+


Solution 1:[1]

#create function
def myMatch( needle, haystack ):
  return haystack[needle]

#create udf
matched = udf(myMatch, StringType()) # your existing data is strings

#apply udf
df.select(\
  df.name ,\ 
  matched( \
    df.character, \
    f.struct( *[df[col] for col in df.columns] ) )\ # shortcut to add all columns to a struct so it can be passed to udf
  .alias("IsPlayingCharacter") )\
.show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Matt Andruff