'AWS Glue pyspark dataframe to pandas null values problem

I am having the following problem with an AWS Glue Job, basically i am trying to clean up a dataframe by filling null values, however from 5 spark dataframes 1 of them the script to fill null values was not working, but on the others it was.

df_opp = df_opp.fillna({'opp_redraw_amount':'0','opp_loan_date':'1970-01-01', .....}

So i decided to print as I am converting them into pandas dataframe and I notice the following that looks really bizarre for me and might be the cause of why na fill was not working, but I dont know how to fix this.

The spark dataframe looks like this:

+-------------+--------------------+---------------------------+-----------------+
|opp_closedate|opp_contact_attempts|opp_days_since_last_payment|opp_edm_follow_up|
+-------------+--------------------+---------------------------+-----------------+
|   2019-03-12|                null|                       null|             null|
|   2020-08-22|                null|                       null|             null|
|   2019-08-02|                null|                       null|             null|
|   2018-08-02|                null|                       null|             null|
|   2019-04-09|                null|                       null|             null|
|   2019-05-01|                null|                       null|             null|
|   2019-03-13|                null|                       null|             null|
|   2019-07-29|                null|                       null|             null|
|   2020-12-04|                null|                       null|             null|
|   2017-09-12|                null|                       null|             null|
+-------------+--------------------+---------------------------+-----------------+

When i convert the dataframe to pandas and print it

df_opp = dfc.select(list(dfc.keys())[2]).toDF()
df_opp.show(10)

pd_df_opp = df_opp.toPandas()
print(pd_df_opp.head(10))

I get some None, null and NaN, I thought this values will be None instead of those 2 other options:

    opp_contact_attempts  opp_days_since_last_payment opp_edm_follow_up  \
40418                   NaN                          NaN              null   
17225                   NaN                          NaN              null   
6151                    NaN                          NaN              null   
24383                   NaN                          NaN              null   
43401                   NaN                          NaN              null   
24462                   NaN                          NaN              null   
45101                   NaN                          NaN              null   
15675                   NaN                          NaN              null   
43002                   NaN                          NaN              null   
7838                    NaN                          NaN              null

Why on spark I have a null but in pandas I have sometimes None, null, NaN? if I print dtypes of that pandas I get object as dtype, and when i print schema on the spark, i have string

What I am missing or how can I properly fill the null values?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'AWS Glue pyspark dataframe to pandas null values problem

Sources

Related Questions