'Pandas ffill() equivalent in PySpark
I have a dataframe which has missing values in a row, and I use
df.ffill(axis=1, inplace=True) to perform the transformation using Pandas.
I want to understand what would be the PySpark equivalent way to achieve this. I have read about using Window functions but those work over the column axis.
Example :
Input :
| id | value1 | value2 | value3 | value4 | value5 |
|---|---|---|---|---|---|
| A | 2 | 3 | NaN | NaN | 6 |
| B | 1 | NaN | NaN | NaN | NaN |
Output :
| id | value1 | value2 | value3 | value4 | value5 |
|---|---|---|---|---|---|
| A | 2 | 3 | 3 | 3 | 6 |
| B | 1 | 1 | 1 | 1 | 1 |
Solution 1:[1]
You can use coalesce it will take values from value3 column if it's not null, otherwise from value2 column
from pyspark.sql.functions import coalesce
df = df.withColumn('value3', coalesce('value3', 'value2'))
To do this for all your dataset you simply do a for loop on all the columns. Like this :
from pyspark.sql.functions import coalesce
cols = df.columns
for i in range(1,len(cols)):
df = df.withColumn(cols[i], coalesce(cols[i], cols[i-1]))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | seghair tarek |
