'Pyspark Dataframe Case expression

I am trying to write a case expression for a column based on the values of a different column comparing it to a list. this is what I have so far. First I made a list looks like this when I print it

partner = [u'AFLK', u'BLCK', u'MACH']

Then I have a dataframe with name 'joined_df' with columns a,b,c,bike_cd,geo all strings.

Now I want to make some transformations on this dataframe. This is what I have so far. I am trying to write a case statement to populate the column 'geo' based on the values of column bike_cd. If a record has a bike_cd that appears in the list 'partner', I want geo of the record to be 'GLOBAL' otherwise it should default to whatever the value is there in 'geo' column.

final_df = joined_df.select(joined_df.a,
joined_df.b,
joined_df.c,
joined_df.bike_cd,
("joined_df.geo", expr ("CASE WHEN joined_df.{0} IN {1} THEN 'GLOBAL' ELSE joined_df.geo END".format(bike_cd,partner))),)

I tried a few thing but nothing worked. Now I see the following error with the above code.

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
NameError: name 'bike_cd' is not defined

Alternaltely I tried this, it works but I am getting an extra column instead changing the existing column geo.

final_df = joined_df.select('*', when(joined_df.bike_cd.isin(partner), 'GLOBAL').otherwise(joined_df.geo))

I am not sure if I got the right idea or this could be totally wrong. I appreciate your help.

Blockquote

apache-spark pyspark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Pyspark Dataframe Case expression

Sources

Related Questions