'Pyspark "due to data type mismatch: differing types" error in Boolean column creation
I am creating boolean columns and filtering if anyone is false in the downstream.
I created the below boolean column in my Pyspark code and it working. I am
df = spark.read.parquet(data_url)
df = df\
.withColumn('d1d3_filter', (df.d1_submit_date.isNotNull() &
(df.d1_review_date.isNull())))\
.withColumn('d41_filter', (df.d4_submit_date.isNotNull() &
(df.d4_review_date.isNull())))\
.withColumn('d42_filter', (df.d42_submit_date.isNotNull() &
(df.d42_review_date.isNull())))\
.withColumn('d45_filter', (df.d5_submit_date.isNotNull() &
(df.d5_review_date.isNull())))\
.withColumn('d6_filter', (df.d8_submit_date.isNotNull() &
(df.d8_review_date.isNull())))
But I need to add another condition, it's throwing an error "due to data type mismatch: differing types in d1_review_date IS NULL) OR d1_status)' (boolean and string)"
df = spark.read.parquet(data_url)
df = df\
.withColumn('d1d3_filter', (df.d1_submit_date.isNotNull() &
(df.d1_review_date.isNull() |
df.d1_status != 'Approved')))\
.withColumn('d41_filter', (df.d4_submit_date.isNotNull() &
(df.d4_review_date.isNull() |
df.d4_status != 'Approved')))\
.withColumn('d42_filter', (df.d42_submit_date.isNotNull() &
(df.d42_review_date.isNull() |
df.d42_status != 'Approved')))\
.withColumn('d45_filter', (df.d5_submit_date.isNotNull() &
(df.d5_review_date.isNull() |
df.d5_status != 'Approved')))\
.withColumn('d6_filter', (df.d8_submit_date.isNotNull() &
(df.d8_review_date.isNull() |
df.d8_status != 'Approved')))
All the columns might have null values.
Why new WithColumn expression is not working? What I am missing here?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
