'Pyspark Dataframe Case expression
I am trying to write a case expression for a column based on the values of a different column comparing it to a list. this is what I have so far. First I made a list looks like this when I print it
partner = [u'AFLK', u'BLCK', u'MACH']
Then I have a dataframe with name 'joined_df' with columns a,b,c,bike_cd,geo all strings.
Now I want to make some transformations on this dataframe. This is what I have so far. I am trying to write a case statement to populate the column 'geo' based on the values of column bike_cd. If a record has a bike_cd that appears in the list 'partner', I want geo of the record to be 'GLOBAL' otherwise it should default to whatever the value is there in 'geo' column.
final_df = joined_df.select(joined_df.a,
joined_df.b,
joined_df.c,
joined_df.bike_cd,
("joined_df.geo", expr ("CASE WHEN joined_df.{0} IN {1} THEN 'GLOBAL' ELSE joined_df.geo END".format(bike_cd,partner))),)
I tried a few thing but nothing worked. Now I see the following error with the above code.
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
NameError: name 'bike_cd' is not defined
Alternaltely I tried this, it works but I am getting an extra column instead changing the existing column geo.
final_df = joined_df.select('*', when(joined_df.bike_cd.isin(partner), 'GLOBAL').otherwise(joined_df.geo))
I am not sure if I got the right idea or this could be totally wrong. I appreciate your help.
Blockquote
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
