'How to use WHEN clause to check Null condition on a String Column of a Pyspark dataframe?
I am trying to check NULL or empty string on a string column of a data frame and 0 for an integer column as given below.
emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header').isNull()) | (F.col('emp_header') == '0'), 'UNKNOWN')) \
.withColumn('emp_item', when((F.col('emp_item').isNull()) | (F.col('emp_item') == 0), -1)) \
.withColumn('emp_lease', when((F.col('emp_header').isNull() | F.col('emp_header') == '0') & (F.col('emp_item').isNull() | F.col('emp_item') == 0), -1)))
The column emp_header is a String column, emp_item is an Integer column and emp_lease is an Integer column.
When I run the above piece of code, I get an error saying there is a data type mismatch in the column emp_header between NULL & STRING as given below.
2022-03-03 07:17:41,931 - src.emp_load - 76 - ERROR - Failed to load history data into emp_data table with the exception: cannot resolve '((`emp_header` IS NULL) OR `emp_header`)' due to data type mismatch: differing types in '((`emp_header` IS NULL) OR `emp_header`)' (boolean and string).;;
So I tried to make NULL to a different syntax as given below.
emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header') == '') | (F.col('emp_header') == '0'), 'UNKOWN')) \
.withColumn('emp_item', when((F.col('emp_item') == '') | (F.col('emp_item') == 0), -1)) \
.withColumn('emp_lease', when((F.col('emp_header') == '' | F.col('emp_header') == '0') & (F.col('emp_item') == '' | F.col('emp_item') == 0), -1))
This time the compiler on my Pycharm says: Expected type 'Column', got 'str' instead at the line marked in the screenshot below.

If I go ahead and run the code with the above changes, the code fails with a different exception
2022-03-03 07:46:08,227 - src.emp_load - 76 - ERROR - Failed to load history data into emp_table with the exception: An error occurred while calling o336.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/local_disk0/tmp/spark-a2890f41-1167-481a-85a9-6984c04d05c2/template_python-1.0.0-py3-none-any.whl/src/emp_load.py", line 71, in main
so_lookup_insert(spark=spark, df=df, years=years, columns=columns)
File "/local_disk0/tmp/spark-a2890f41-1167-481a-85a9-6984c04d05c2/template_python-1.0.0-py3-none-any.whl/src/emp_load.py", line 100, in so_lookup_insert
.withColumn('emp_lease', when((F.col('emp_header') == '' | F.col('emp_header') == '0') & (F.col('emp_item') == '' | F.col('emp_item') == 0), -1))
File "/databricks/spark/python/pyspark/sql/column.py", line 118, in _
njc = getattr(self._jc, name)(jc)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/databricks/spark/python/pyspark/sql/utils.py", line 127, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value
format(target_id, ".", name, value))
Update 1:
I added parenthesis to the when condition on the third line as suggested in the comment and I am not facing the second exception anymore.
And I made isNull() to '' as below.
emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header') == '') | (F.col('emp_header') == '0'), 'UNKNOWN')) \
.withColumn('emp_item', when((F.col('emp_item') == '') | (F.col('emp_item') == 0), -1)) \
.withColumn('emp_lease', when(((F.col('emp_header') == '') | F.col('emp_header') == '0') & ((F.col('emp_item') == '') | F.col('emp_item') == 0), -1)))
But I still see the exception:
2022-03-03 08:41:31,295 - src.emp_load - 76 - ERROR - Failed to load history data into emp_table with the exception: cannot resolve '((`emp_header` = '') OR `emp_header`)' due to data type mismatch: differing types in '((`emp_header` = '') OR `emp_header`)' (boolean and string).;;
Can anyone let me know what is the mistake I am doing here and how can I fix it?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
