'How to use WHEN clause to check Null condition on a String Column of a Pyspark dataframe?

I am trying to check NULL or empty string on a string column of a data frame and 0 for an integer column as given below.

emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header').isNull()) | (F.col('emp_header') == '0'), 'UNKNOWN')) \
    .withColumn('emp_item', when((F.col('emp_item').isNull()) | (F.col('emp_item') == 0), -1)) \
    .withColumn('emp_lease', when((F.col('emp_header').isNull() | F.col('emp_header') == '0') & (F.col('emp_item').isNull() | F.col('emp_item') == 0), -1)))

The column emp_header is a String column, emp_item is an Integer column and emp_lease is an Integer column.

When I run the above piece of code, I get an error saying there is a data type mismatch in the column emp_header between NULL & STRING as given below.

2022-03-03 07:17:41,931 - src.emp_load -   76 - ERROR - Failed to load history data into emp_data table with the exception: cannot resolve '((`emp_header` IS NULL) OR `emp_header`)' due to data type mismatch: differing types in '((`emp_header` IS NULL) OR `emp_header`)' (boolean and string).;;

So I tried to make NULL to a different syntax as given below.

emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header') == '') | (F.col('emp_header') == '0'), 'UNKOWN')) \
    .withColumn('emp_item', when((F.col('emp_item') == '') | (F.col('emp_item') == 0), -1)) \
    .withColumn('emp_lease', when((F.col('emp_header') == '' | F.col('emp_header') == '0') & (F.col('emp_item') == '' | F.col('emp_item') == 0), -1))

This time the compiler on my Pycharm says: Expected type 'Column', got 'str' instead at the line marked in the screenshot below. enter image description here

If I go ahead and run the code with the above changes, the code fails with a different exception

2022-03-03 07:46:08,227 - src.emp_load - 76 - ERROR - Failed to load history data into emp_table with the exception: An error occurred while calling o336.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last):
  File "/local_disk0/tmp/spark-a2890f41-1167-481a-85a9-6984c04d05c2/template_python-1.0.0-py3-none-any.whl/src/emp_load.py", line 71, in main
    so_lookup_insert(spark=spark, df=df, years=years, columns=columns)
  File "/local_disk0/tmp/spark-a2890f41-1167-481a-85a9-6984c04d05c2/template_python-1.0.0-py3-none-any.whl/src/emp_load.py", line 100, in so_lookup_insert
    .withColumn('emp_lease', when((F.col('emp_header') == '' | F.col('emp_header') == '0') & (F.col('emp_item') == '' | F.col('emp_item') == 0), -1))
  File "/databricks/spark/python/pyspark/sql/column.py", line 118, in _
    njc = getattr(self._jc, name)(jc)
  File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/databricks/spark/python/pyspark/sql/utils.py", line 127, in deco
    return f(*a, **kw)
  File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value
    format(target_id, ".", name, value))

Update 1: I added parenthesis to the when condition on the third line as suggested in the comment and I am not facing the second exception anymore. And I made isNull() to '' as below.

emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header') == '') | (F.col('emp_header') == '0'), 'UNKNOWN')) \
    .withColumn('emp_item', when((F.col('emp_item') == '') | (F.col('emp_item') == 0), -1)) \
    .withColumn('emp_lease', when(((F.col('emp_header') == '') | F.col('emp_header') == '0') & ((F.col('emp_item') == '') | F.col('emp_item') == 0), -1)))

But I still see the exception:

2022-03-03 08:41:31,295 - src.emp_load - 76 - ERROR - Failed to load history data into emp_table with the exception: cannot resolve '((`emp_header` = '') OR `emp_header`)' due to data type mismatch: differing types in '((`emp_header` = '') OR `emp_header`)' (boolean and string).;;

Can anyone let me know what is the mistake I am doing here and how can I fix it?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source