'Reorganize data using Dataframe


Hi! I received a very disorganized and messy excel file and I will need to re-organize them into more presentable format. However I am stuck on how to proceed. :(

Data received:

| 2020 | 2021 | 2022 | 2022 % Total | 2023E | 2024E | 2025E | ... |
| ---- | ---- | ---- | ------------ | ----- | ----- | ----- | --- |
|  0   |  3   |  6   |      9       |  12   |  15   |  18   | ... |
|  1   |  4   |  7   |      10      |  13   |  16   |  19   | ... |
|  2   |  5   |  8   |      11      |  14   |  17   |  20   | ... |

Expected output:
| Year | Value |
| ---- | ----- |
| 2020 |   0   |
| 2020 |   1   |
| 2020 |   2   |
| 2021 |   3   |
| 2021 |   4   |
| 2021 |   5   |
| 2022 |   6   |
| 2022 |   7   |
| 2022 |   8   |

The headers of the received file contains various year, starting from 2020. How it works is I will only need the data starting from oldest year (2020) to latest valid year (2022), any data that comes after header with latest valid year is not required (e.g. starting with header containing " % Total"). The latest valid year will keep adding on every year, so next year I will expect a new "2023" column on the 4th column.

After that I will need to append data from "2020", "2021" and "2022" to a new "Value" column. A new "Year" column will also be created for the corresponding year header.

I am not sure whether it is something that can be achieved using Dataframe.

Any suggestions will be greatly appreciated!

Regards,
Shan



Solution 1:[1]

If you know the keys you want (i.e. 2020, 2021, 2022), you can do this:

import pandas as pd

# Create a dummy dataframe
df = pd.DataFrame({
    "2020": [0, 1, 2],
    "2021": [3, 4, 5],
    "2022": [6, 7, 8]
})

keys = ["2020", "2021", "2022"]
df[keys[0]]


arr = []  # 2D array to hold restructured data
for key in keys:
    arr.extend([[key, v] for v in df[key]])

new_df = pd.DataFrame(arr, columns=["Year", "Value"])
new_df.head()

You could generate the years list in code too, instead of hard coding that:

start = 2020
end = 2022
keys = [str(i) for i in range(start, end + 1)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 kwsp