'Merge dataframes with diferent number of rows in Python
I initially have a dataframe, df, that looks like: (these are country temperature data)
|date | Brasil | Colombia | Peru | Mexico|
------------------------------------------------
|2020-09-25|1 |4 |3 |2 |
|2020-09-26|3 |4 |3 |1 |
|2020-09-27|7 |5 |4 |3 |
|2020-09-28|0 |2 |4 |6 |
|2020-09-29|2 |4 |3 |4 |
|2020-09-30|1 |2 |3 |4 |
And I have to add the dates missing till today, these dates come from dataframes for each country, which are in a directory of my computer and look like this, for Colombia as an example:
|datetime |temp |temp_max |
------------------------------
|2020-10-01 |5 |23 |
|2020-10-02 |6 |22 |
|2020-10-01 |9 |23 |
|2020-10-03 |4 |25 |
|2020-10-04 |2 |4 |
|2020-10-05 |4 |12 |
|2020-10-06 |1 |11 |
the name of these dataframe on the directory is like this: Colombia_20201006. I made the next code to read the files and complete the df dataframe.
columnas = numpy.array(df.columns)
columnas = numpy.delete(columnas,0)
path = "D:/Users/user/Documentos/"
lista_archivos = os.listdir(path)
for j in columnas: #Iternado por ciudades
paises_archivos = []
for i in lista_archivos:
paises_activos.append(path+i) if i.startswith(f"{j}") else print("", end="")
print(f"los archivos para el pais {j} son: {paises_archivos}")
for g in paises_archivos:
paises_concat = pd.read_csv(f"{g}")
paises_concat = paises_concat[["datetime","temp"]]
df_out = df.merge(paises_concat,how="outer", left_on="date", right_on="datetime")
df_out = df_out.fillna("")
df_out.loc[(df_out.date == "") & (df_out.datetime != ""), "date"] = df_out["datetime"]
df_out.loc[(df_out[f"{j}"] == "") & (df_out.temp != ""), f"{j}"] = df_out["temp"]
df_out = df_out.drop(["datetime", "temp"], axis=1)
The problem is that not necessarily all files are the same and there may be some that do not have the same date, for example there may (this is a possibility, not a certainty) be some that, for example, do not have data between 2020-10-01 and 2022-01-31, and their dataframe looks like this (Let's assume this is the case for Peru):
|datetime |temp |temp_max |
------------------------------
|2022-02-01 |5 |13 |
|2022-02-02 |8 |12 |
|2022-02-03 |9 |13 |
|2022-02-04 |1 |15 |
|2022-02-05 |2 |14 |
|2022-02-06 |5 |14 |
|2022-02-07 |1 |15 |
And what I need is that it fills the missing spaces until 2022-02-01 with NA and looks something more or less like this:
|date | Brasil | Colombia | Peru | Mexico|
------------------------------------------------
|2020-09-25|1 |4 |3 |2 |
|2020-09-26|3 |4 |3 |1 |
|2020-09-27|7 |5 |4 |3 |
|2020-09-28|0 |2 |4 |6 |
|2020-09-29|2 |4 |3 |4 |
|2020-09-30|1 |2 |3 |4 |
|2020-10-01|1 |2 |NA |4 |
|2020-10-02|1 |2 |NA |4 |
|2020-10-03|1 |2 |NA |4 |
|2020-10-04|1 |2 |NA |4 |
|... |... |... |... |... |
|2022-02-01|1 |5 |5 |6 |
|2022-02-02|1 |6 |8 |4 |
|2022-02-03|1 |9 |9 |8 |
|2022-02-04|1 |4 |1 |2 |
|2022-02-05|1 |2 |2 |1 |
|2022-02-06|1 |4 |5 |4 |
|2022-02-07|1 |1 |1 |3 |
But right now what the code is doing is leaving the dataframe as follows:
|date | Brasil | Colombia | Peru | Mexico|
------------------------------------------------
|2020-09-25|1 |4 |3 |2 |
|2020-09-26|3 |4 |3 |1 |
|2020-09-27|7 |5 |4 |3 |
|2020-09-28|0 |2 |4 |6 |
|2020-09-29|2 |4 |3 |4 |
|2020-09-30|1 |2 |3 |4 |
|2022-02-01|1 |5 |5 |6 |
|2022-02-02|1 |6 |8 |4 |
|2022-02-03|1 |9 |9 |8 |
|2022-02-04|1 |4 |1 |2 |
|2022-02-05|1 |2 |2 |1 |
|2022-02-06|1 |4 |5 |4 |
|2022-02-07|1 |1 |1 |3 |
|2020-10-01|1 |2 | |4 |
|2020-10-02|1 |2 | |4 |
|2020-10-03|1 |2 | |4 |
|2020-10-04|1 |2 | |4 |
|... |... |... |... |... |
I really need the code to be robust to the possibility that there may not be data for all dates for all countries. Can you guys help me?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
