'How to concatenate the values of a dataframe along column axis and fill missing values?

I really stuck in this problem for a long time.

I have a data frame, I want to group the data based on the ids and then stick the values for each id together. However, since the number of the ids are not equal, I have to fill the missing values for those ids with same values of the day before (or the day after) of those ids. I previously ask this question and I got the response for that.

Here is the last question : Group the dataframe based on their ids and stick the values of ids to each other with the same values of the day before or the dayafter

Here is the code that I am using for my data.

df = pd.read_csv("df_1.csv")
val_cols = df.filter(like="value_").columns
DF_new = (
    df.pivot("id", "date", val_cols)
    .groupby(level=0, axis=1)
    .apply(lambda x: x.ffill(axis=1).bfill(axis=1))
    .sort_index(axis=1, level=1)
)
DF_new.columns = [f"val{i}" for i in range(DF_new.shape[1])]
DF_new = DF_new.reset_index()
DF_new

But, when I want to test this on my real data, it does not provide the result that I want. It should started with 1, 2, 3, 4, ... but, here it started from 1, 10, ...

enter image description here

Here is a sample of my data (also I will put the link in the comment)

I have provided a simple example as below, for example, for id=1, id=2 there are two dates, but for id=3, there are three days.

df = pd.DataFrame()
df['id'] = [1, 1, 2,2, 3, 3, 3]
df['date'] = ['2019-01-02', '2019-01-03', '2019-01-01','2019-01-02', '2019-01-01',      '2019-01-02','2019-01-03']
df['val1']  = [10, 100, 20, 30, 40, 50, 60]
df['val2']  = [30, 30, -20, -30, -40,-50, -60 ]
df['val3']  = [50, 10, 120, 300, 140, 150, 160]
print(df)
   id        date  val0  val1  val2
0   1  2019-01-02    10    30    50
1   1  2019-01-03   100    30    10
2   2  2019-01-01    20   -20   120
3   2  2019-01-02    30   -30   300
4   3  2019-01-01    40   -40   140
5   3  2019-01-02    50   -50   150
6   3  2019-01-03    60   -60   160

The desired output is:

    val0  val1   val2  val3  val4   val5   val6  val7   val8
id                                                          
1   10.0  30.0   50.0  10.0  30.0   50.0  100.0  30.0   10.0
2   20.0 -20.0  120.0  30.0 -30.0  300.0   30.0 -30.0  300.0
3   40.0 -40.0  140.0  50.0 -50.0  150.0   60.0 -60.0  160.0

The provided code can work with this easy example.

Can anybody help me with that? Thanks



Solution 1:[1]

Here is one way to do it:

temp_df = pd.DataFrame(
    {
        "id": [j for i in sorted(df["id"].unique()) for j in [i] * 3],
        "date": sorted(df["date"].unique()) * 3,
    }
)
temp_df = (
    pd.merge(left=temp_df, right=df, how="left", on=["id", "date"])
    .sort_values(by=["id", "date"])
    .drop(columns="date")
)

dfs = []
for i in sorted(df["id"].unique()):
    dfs.append(
        temp_df.loc[temp_df["id"] == i, :].fillna(method="bfill").fillna(method="ffill")
    )
temp_df = pd.concat(dfs)

dfs = []
for i in sorted(df["id"].unique()):
    rows = {j: row for j, row in temp_df.loc[temp_df["id"] == i, :].iterrows()}
    dfs.append(pd.concat([pd.DataFrame(row).T for row in rows.values()], axis=1))

final_df = pd.concat(dfs)

final_df.columns = [
    f"val{i}" if col != "id" else col for i, col in enumerate(final_df.columns)
]

final_df = final_df.loc[:, ~final_df.columns.duplicated()].drop_duplicates(keep="last")

final_df["id"] = final_df["id"].fillna(method="ffill").astype(int)

dfs = []
for i in sorted(final_df["id"].unique()):
    dfs.append(
        final_df.loc[final_df["id"] == i, :]
        .fillna(method="bfill")
        .fillna(method="ffill")
    )
final_df = pd.concat(dfs).drop_duplicates().set_index("id")
final_df.columns = [f"val{i}" for i in range(final_df.shape[1])]
print(final_df)
# Output
    val0  val1   val2  val3  val4   val5   val6  val7   val8
id
1   10.0  30.0   50.0  10.0  30.0   50.0  100.0  30.0   10.0
2   20.0 -20.0  120.0  30.0 -30.0  300.0   30.0 -30.0  300.0
3   40.0 -40.0  140.0  50.0 -50.0  150.0   60.0 -60.0  160.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Laurent