'How to make individual columns for grouped excel columns using pandas
I have an excel database which has columns grouped as shown below:
I have tried to create a pandas dataframe using the following code:
import pandas as pd
df = pd.read_html("filename.xls")
print(df[3])
which gives me the following output:
0 1 2 3 4 ... 7 8 9 10 11
0 1 28AP2000000002 Andhra Pradesh 1999-00 Guntur ... 0.00 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE09350.00MT
1 13 28AP20010000013 Andhra Pradesh 2000-01 Anantapur ... 0.00 32.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE03200.00MT
2 14 28AP20010000228 Andhra Pradesh 2001-02 Vizianagaram ... 0.00 23.65 NHB Completed COLD / CA STORAGECOLD STORAGE02365.00MT
3 15 28AP20010000014 Andhra Pradesh 2001-02 East Godavari ... 0.00 25.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE02500.00MT
4 16 28AP20010000015 Andhra Pradesh 2001-02 Vizianagaram ... 0.00 31.50 Through NABARD Completed COLD / CA STORAGECOLD STORAGE03150.00MT
.. .. ... ... ... ... ... ... ... ... ... ...
80 92 28AAP0CS000036 Andhra Pradesh 2017-18 Guntur ... 965.00 214.20 NHB Completed COLD / CA STORAGECOLD STORAGE49760.00MT
81 93 28AAP0CS000029 Andhra Pradesh 2018-19 Guntur ... 40.00 10.82 NHB Completed COLD / CA STORAGECA STORAGE23000.00MT
82 94 28AAP0CS000034 Andhra Pradesh 2018-19 Kadapa ... 994.53 181.54 NHB Completed COLD / CA STORAGECOLD STORAGE414357.00MT
83 95 28AAP0CS000039 Andhra Pradesh 2019-20 Kurnool ... 612.00 190.13 NHB Completed COLD / CA STORAGECOLD STORAGE49930.00MT
84 96 28AAP0CS000038 Andhra Pradesh 2019-20 Kurnool ... 600.00 188.25 NHB Completed COLD / CA STORAGECOLD STORAGE410346.00MT
In the generated dataframe above, you can see that the last column is merged because of grouped tables in the excel file. Can anyone suggest a solution to separate it out based on the final column names?
Solution 1:[1]
Instead of trying to split the columns afterwards, I propose to get the proper columns from the reader. It works with pd.read_excel() and should also work with pd.read_html() according to the documentation. I can't test with pd.read_html() as I don't have a matching source.
The issue is that you are just passing the file and letting pandas decide how to read it. It is fine for a basic file where the first row is the column title and then the other ones the data. But when you have a different form as you have, you need to guide it.
Therefore, you can choose to skip the n first rows and let it know what the column names are. header is used to tell which row should be used for column names, since you don't have any row with all the titles, I chose to use None and manually add the column titles.
# Use skiprows to indicate how many rows to ignore
df = pd.read_html("filename.xls", skiprows=12, header=None)
# Rename columns to match your file.
df.columns = ["SI. No", "Project Cc", "State", "Year of Sub", "District", "Name of E", "Project A", "Total Project"]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ssayan |
