'How to make individual columns for grouped excel columns using pandas

I have an excel database which has columns grouped as shown below: Head of grouped data I have tried to create a pandas dataframe using the following code:

import pandas as pd
df = pd.read_html("filename.xls")
print(df[3])

which gives me the following output:

 0                1               2        3              4   ...      7       8               9          10                                        11
0    1   28AP2000000002  Andhra Pradesh  1999-00         Guntur  ...    0.00   50.00  Through NABARD  Completed   COLD / CA STORAGECOLD STORAGE09350.00MT
1   13  28AP20010000013  Andhra Pradesh  2000-01      Anantapur  ...    0.00   32.00  Through NABARD  Completed   COLD / CA STORAGECOLD STORAGE03200.00MT
2   14  28AP20010000228  Andhra Pradesh  2001-02   Vizianagaram  ...    0.00   23.65             NHB  Completed   COLD / CA STORAGECOLD STORAGE02365.00MT
3   15  28AP20010000014  Andhra Pradesh  2001-02  East Godavari  ...    0.00   25.00  Through NABARD  Completed   COLD / CA STORAGECOLD STORAGE02500.00MT
4   16  28AP20010000015  Andhra Pradesh  2001-02   Vizianagaram  ...    0.00   31.50  Through NABARD  Completed   COLD / CA STORAGECOLD STORAGE03150.00MT
..  ..              ...             ...      ...            ...  ...     ...     ...             ...        ...                                       ...
80  92   28AAP0CS000036  Andhra Pradesh  2017-18         Guntur  ...  965.00  214.20             NHB  Completed   COLD / CA STORAGECOLD STORAGE49760.00MT
81  93   28AAP0CS000029  Andhra Pradesh  2018-19         Guntur  ...   40.00   10.82             NHB  Completed     COLD / CA STORAGECA STORAGE23000.00MT
82  94   28AAP0CS000034  Andhra Pradesh  2018-19         Kadapa  ...  994.53  181.54             NHB  Completed  COLD / CA STORAGECOLD STORAGE414357.00MT
83  95   28AAP0CS000039  Andhra Pradesh  2019-20        Kurnool  ...  612.00  190.13             NHB  Completed   COLD / CA STORAGECOLD STORAGE49930.00MT
84  96   28AAP0CS000038  Andhra Pradesh  2019-20        Kurnool  ...  600.00  188.25             NHB  Completed  COLD / CA STORAGECOLD STORAGE410346.00MT

In the generated dataframe above, you can see that the last column is merged because of grouped tables in the excel file. Can anyone suggest a solution to separate it out based on the final column names?



Solution 1:[1]

Instead of trying to split the columns afterwards, I propose to get the proper columns from the reader. It works with pd.read_excel() and should also work with pd.read_html() according to the documentation. I can't test with pd.read_html() as I don't have a matching source.

The issue is that you are just passing the file and letting pandas decide how to read it. It is fine for a basic file where the first row is the column title and then the other ones the data. But when you have a different form as you have, you need to guide it.

Therefore, you can choose to skip the n first rows and let it know what the column names are. header is used to tell which row should be used for column names, since you don't have any row with all the titles, I chose to use None and manually add the column titles.

# Use skiprows to indicate how many rows to ignore
df = pd.read_html("filename.xls", skiprows=12, header=None)
# Rename columns to match your file.
df.columns = ["SI. No", "Project Cc", "State", "Year of Sub", "District", "Name of E", "Project A", "Total Project"]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ssayan