'How two dataframes in python and replace the null values from one dataframe column to another column in pyspark?

Suppose I have a df with 5 columns and a second df with 6 columns. I want to join df1 with df2 such that the null rows of a column in df1 get replaced by a not null value in df2. How do I do this in python?

I don't want to specify the name of the columns, hard code them. I want to make a robust logic that works even if in the future we need to replace rows for 7 cols instead of 6.

Sample Data: df1=

col1    col2      col3   col5
1       null      null     5
2       null       5       9
4        4         8       6
null     0         9       1

df2=

col1     col2     col3    col4
1         2        -3      5
null     null       7      5
4         4         8      1
1         null      9      3

Final df=

  col1       col2    col3    col5     col4
    1         2        -3     5         5
    2       null       5      9         5
    4        4         8      6         1
    1        0         9      1         3 

Conditions:

  1. The null rows of a column in df1 get replaced by a not null value in df2
  2. if both data frames have different not null values on the same index, take the first one or second one. Doesn't matter.
  3. if both of them are null, the final df will have null values on that very same index.
  4. I don't want to specify the column names, just want to have a robust script that works for other data as well with different column names.


Solution 1:[1]

I want to join df1 with df2 such that the null rows of a column in df1 get replaced by a not null value in df2. How do I do this in python?

Just join and you can use coalesce to get the first non-null value

I don't want to specify the name of the columns, hard cord them.

You can access columns' name via df.columns, and access columns' datatypes via df.dtypes

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pltc