'How two dataframes in python and replace the null values from one dataframe column to another column in pyspark?
Suppose I have a df with 5 columns and a second df with 6 columns. I want to join df1 with df2 such that the null rows of a column in df1 get replaced by a not null value in df2. How do I do this in python?
I don't want to specify the name of the columns, hard code them. I want to make a robust logic that works even if in the future we need to replace rows for 7 cols instead of 6.
Sample Data: df1=
col1 col2 col3 col5
1 null null 5
2 null 5 9
4 4 8 6
null 0 9 1
df2=
col1 col2 col3 col4
1 2 -3 5
null null 7 5
4 4 8 1
1 null 9 3
Final df=
col1 col2 col3 col5 col4
1 2 -3 5 5
2 null 5 9 5
4 4 8 6 1
1 0 9 1 3
Conditions:
- The null rows of a column in df1 get replaced by a not null value in df2
- if both data frames have different not null values on the same index, take the first one or second one. Doesn't matter.
- if both of them are null, the final df will have null values on that very same index.
- I don't want to specify the column names, just want to have a robust script that works for other data as well with different column names.
Solution 1:[1]
I want to join df1 with df2 such that the null rows of a column in df1 get replaced by a not null value in df2. How do I do this in python?
Just join and you can use coalesce to get the first non-null value
I don't want to specify the name of the columns, hard cord them.
You can access columns' name via df.columns, and access columns' datatypes via df.dtypes
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | pltc |
