'Read multi-dimensional table from Excel into DataFrame

I have a UK 2011 Census Statistics file. I have written somewhat painful code to load it but am wondering if there is a simpler solution.

The file has many sheets, with similar but slightly varying structure. Here is an image of a complex one:

enter image description here

So, the general description of a sheet is:

  1. Some text rows
  2. One or two rows of hierarchical column headings
  3. Either one level, or two-level hierarchical data rows

I wish to create a DataFrame with columns for each of the column and row categories, and one column for the table data. The example sheet should give:

  Age 16 and over All households    Sex             Household type Dataset
0             All            All    All                        All       1
1             All            All    All      In a couple household       2
2             All            All    All  Not in a couple household       3
3             All            All  Males                        All       4
4             All            All  Males      In a couple household       5
...

(I have converted entries of the type "All categories: X" to "All".)

I use pd.parse to load the sheet. Here is a fragment of the resulting DataFrame:

                                                    0                               1                      2                          3
0                                       Back to Index                             NaN                    NaN                        NaN
1                                                 NaN                             NaN                    NaN                        NaN
2   DC1112EWla: Age of youngest dependent child by...                             NaN                    NaN                        NaN
3                                                 NaN                             NaN                    NaN                        NaN
4   Table population: All usual residents aged 16 ...                             NaN                    NaN                        NaN
5                                                 NaN                             NaN                    NaN                        NaN
6                                                 NaN             All categories: Sex                    NaN                        NaN
7                                                 NaN  All categories: Household type  In a couple household  Not in a couple household
8                     All categories: Age 16 and over                             NaN                    NaN                        NaN
9                      All categories: All households                               1                      2                          3
10              Households with no dependent children                              10                     11                         12
11          Households with dependent children: Total                              19                     20                         21
12               Youngest dependent child: Age 0 to 4                              28                     29                         30
13               Youngest dependent child: Age 5 to 9                              37                     38                         39
14             Youngest dependent child: Age 10 to 15                              46                     47                         48
15             Youngest dependent child: Age 16 to 18                              55                     56                         57
16                                       Age 16 to 24                             NaN                    NaN                        NaN
17                     All categories: All households                              64                     65                         66
18              Households with no dependent children                              73                     74                         75
19          Households with dependent children: Total                              82                     83                         84
20               Youngest dependent child: Age 0 to 4                              91                     92                         93
21               Youngest dependent child: Age 5 to 9                             100                    101                        102
22             Youngest dependent child: Age 10 to 15                             109                    110                        111

I have written somewhat painful code that extracts this data from the sheet:

  • col_level_names - names of the column categories, e.g. ['Sex', 'Household type']
  • col_level_values - values of the column categories for every column, e.g. [['All', 'All', 'All', 'Males', 'Males', 'Males', 'Females', 'Females', 'Females'], ['All', 'In a couple household', ...]
  • row_level_names - names of the row categories, e.g. ['Age 16 and over', 'All households']
  • row_level_values - values of the row categories for every row, e.g. [['All', ..., 'Age 65 and over'], ['All', 'Households with no dependent children', ...]]

I then construct the DataFrame like this:

num_cols = len(col_level_values[0])
num_rows = len(row_level_values[0])
for l in range(row_levels):
    # Repeat row values in order
    row_level_values[l] = [x for x in row_level_values[l]
                           for n in range(num_cols)]
for l in range(col_levels):
    # Repeat col values in turn
    col_level_values[l] = col_level_values[l] * num_rows
values = []
for r in range(0, num_rows):
    row = data_row_indexes[r]
    values.extend(table.iloc[row, 1:])
data = [row_level_values[l] for l in range(row_levels)] + \
       [col_level_values[l] for l in range(col_levels)] + \
    [values]
index = row_level_names+col_level_names+['Dataset']
df = pd.DataFrame(
    data=data,
    index=index
)
df = df.transpose()

Is there a simpler way to parse the sheet and construct my DataFrame?

FYI Here is the sheet in dictionary form:

{0: {0: 'Back to Index', 1: nan, 2: 'DC1112EWla: Age of youngest dependent child by household type by sex by age', 3: nan, 4: 'Table population: All usual residents aged 16 and over in households', 5: nan, 6: nan, 7: nan, 8: 'All categories: Age 16 and over', 9: 'All categories: All households', 10: 'Households with no dependent children', 11: 'Households with dependent children: Total', 12: 'Youngest dependent child: Age 0 to 4', 13: 'Youngest dependent child: Age 5 to 9', 14: 'Youngest dependent child: Age 10 to 15', 15: 'Youngest dependent child: Age 16 to 18', 16: 'Age 16 to 24', 17: 'All categories: All households', 18: 'Households with no dependent children', 19: 'Households with dependent children: Total', 20: 'Youngest dependent child: Age 0 to 4', 21: 'Youngest dependent child: Age 5 to 9', 22: 'Youngest dependent child: Age 10 to 15', 23: 'Youngest dependent child: Age 16 to 18', 24: 'Age 25 to 34', 25: 'All categories: All households', 26: 'Households with no dependent children', 27: 'Households with dependent children: Total', 28: 'Youngest dependent child: Age 0 to 4', 29: 'Youngest dependent child: Age 5 to 9', 30: 'Youngest dependent child: Age 10 to 15', 31: 'Youngest dependent child: Age 16 to 18', 32: 'Age 35 to 49', 33: 'All categories: All households', 34: 'Households with no dependent children', 35: 'Households with dependent children: Total', 36: 'Youngest dependent child: Age 0 to 4', 37: 'Youngest dependent child: Age 5 to 9', 38: 'Youngest dependent child: Age 10 to 15', 39: 'Youngest dependent child: Age 16 to 18', 40: 'Age 50 to 64', 41: 'All categories: All households', 42: 'Households with no dependent children', 43: 'Households with dependent children: Total', 44: 'Youngest dependent child: Age 0 to 4', 45: 'Youngest dependent child: Age 5 to 9', 46: 'Youngest dependent child: Age 10 to 15', 47: 'Youngest dependent child: Age 16 to 18', 48: 'Age 65 and over', 49: 'All categories: All households', 50: 'Households with no dependent children', 51: 'Households with dependent children: Total', 52: 'Youngest dependent child: Age 0 to 4', 53: 'Youngest dependent child: Age 5 to 9', 54: 'Youngest dependent child: Age 10 to 15', 55: 'Youngest dependent child: Age 16 to 18', 56: nan, 57: 'Crown Copyright applies unless otherwise stated, [email protected]'}, 1: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: 'All categories: Sex', 7: 'All categories: Household type', 8: nan, 9: 1, 10: 10, 11: 19, 12: 28, 13: 37, 14: 46, 15: 55, 16: nan, 17: 64, 18: 73, 19: 82, 20: 91, 21: 100, 22: 109, 23: 118, 24: nan, 25: 127, 26: 136, 27: 145, 28: 154, 29: 163, 30: 172, 31: 181, 32: nan, 33: 190, 34: 199, 35: 208, 36: 217, 37: 226, 38: 235, 39: 244, 40: nan, 41: 253, 42: 262, 43: 271, 44: 280, 45: 289, 46: 298, 47: 307, 48: nan, 49: 316, 50: 325, 51: 334, 52: 343, 53: 352, 54: 361, 55: 370, 56: nan, 57: nan}, 2: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'In a couple household', 8: nan, 9: 2, 10: 11, 11: 20, 12: 29, 13: 38, 14: 47, 15: 56, 16: nan, 17: 65, 18: 74, 19: 83, 20: 92, 21: 101, 22: 110, 23: 119, 24: nan, 25: 128, 26: 137, 27: 146, 28: 155, 29: 164, 30: 173, 31: 182, 32: nan, 33: 191, 34: 200, 35: 209, 36: 218, 37: 227, 38: 236, 39: 245, 40: nan, 41: 254, 42: 263, 43: 272, 44: 281, 45: 290, 46: 299, 47: 308, 48: nan, 49: 317, 50: 326, 51: 335, 52: 344, 53: 353, 54: 362, 55: 371, 56: nan, 57: nan}, 3: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'Not in a couple household', 8: nan, 9: 3, 10: 12, 11: 21, 12: 30, 13: 39, 14: 48, 15: 57, 16: nan, 17: 66, 18: 75, 19: 84, 20: 93, 21: 102, 22: 111, 23: 120, 24: nan, 25: 129, 26: 138, 27: 147, 28: 156, 29: 165, 30: 174, 31: 183, 32: nan, 33: 192, 34: 201, 35: 210, 36: 219, 37: 228, 38: 237, 39: 246, 40: nan, 41: 255, 42: 264, 43: 273, 44: 282, 45: 291, 46: 300, 47: 309, 48: nan, 49: 318, 50: 327, 51: 336, 52: 345, 53: 354, 54: 363, 55: 372, 56: nan, 57: nan}, 4: {0: nan, 1: nan, 2: ' ', 3: nan, 4: nan, 5: nan, 6: 'Males', 7: 'All categories: Household type', 8: nan, 9: 4, 10: 13, 11: 22, 12: 31, 13: 40, 14: 49, 15: 58, 16: nan, 17: 67, 18: 76, 19: 85, 20: 94, 21: 103, 22: 112, 23: 121, 24: nan, 25: 130, 26: 139, 27: 148, 28: 157, 29: 166, 30: 175, 31: 184, 32: nan, 33: 193, 34: 202, 35: 211, 36: 220, 37: 229, 38: 238, 39: 247, 40: nan, 41: 256, 42: 265, 43: 274, 44: 283, 45: 292, 46: 301, 47: 310, 48: nan, 49: 319, 50: 328, 51: 337, 52: 346, 53: 355, 54: 364, 55: 373, 56: nan, 57: nan}, 5: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'In a couple household', 8: nan, 9: 5, 10: 14, 11: 23, 12: 32, 13: 41, 14: 50, 15: 59, 16: nan, 17: 68, 18: 77, 19: 86, 20: 95, 21: 104, 22: 113, 23: 122, 24: nan, 25: 131, 26: 140, 27: 149, 28: 158, 29: 167, 30: 176, 31: 185, 32: nan, 33: 194, 34: 203, 35: 212, 36: 221, 37: 230, 38: 239, 39: 248, 40: nan, 41: 257, 42: 266, 43: 275, 44: 284, 45: 293, 46: 302, 47: 311, 48: nan, 49: 320, 50: 329, 51: 338, 52: 347, 53: 356, 54: 365, 55: 374, 56: nan, 57: nan}, 6: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'Not in a couple household', 8: nan, 9: 6, 10: 15, 11: 24, 12: 33, 13: 42, 14: 51, 15: 60, 16: nan, 17: 69, 18: 78, 19: 87, 20: 96, 21: 105, 22: 114, 23: 123, 24: nan, 25: 132, 26: 141, 27: 150, 28: 159, 29: 168, 30: 177, 31: 186, 32: nan, 33: 195, 34: 204, 35: 213, 36: 222, 37: 231, 38: 240, 39: 249, 40: nan, 41: 258, 42: 267, 43: 276, 44: 285, 45: 294, 46: 303, 47: 312, 48: nan, 49: 321, 50: 330, 51: 339, 52: 348, 53: 357, 54: 366, 55: 375, 56: nan, 57: nan}, 7: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: 'Females', 7: 'All categories: Household type', 8: nan, 9: 7, 10: 16, 11: 25, 12: 34, 13: 43, 14: 52, 15: 61, 16: nan, 17: 70, 18: 79, 19: 88, 20: 97, 21: 106, 22: 115, 23: 124, 24: nan, 25: 133, 26: 142, 27: 151, 28: 160, 29: 169, 30: 178, 31: 187, 32: nan, 33: 196, 34: 205, 35: 214, 36: 223, 37: 232, 38: 241, 39: 250, 40: nan, 41: 259, 42: 268, 43: 277, 44: 286, 45: 295, 46: 304, 47: 313, 48: nan, 49: 322, 50: 331, 51: 340, 52: 349, 53: 358, 54: 367, 55: 376, 56: nan, 57: nan}, 8: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'In a couple household', 8: nan, 9: 8, 10: 17, 11: 26, 12: 35, 13: 44, 14: 53, 15: 62, 16: nan, 17: 71, 18: 80, 19: 89, 20: 98, 21: 107, 22: 116, 23: 125, 24: nan, 25: 134, 26: 143, 27: 152, 28: 161, 29: 170, 30: 179, 31: 188, 32: nan, 33: 197, 34: 206, 35: 215, 36: 224, 37: 233, 38: 242, 39: 251, 40: nan, 41: 260, 42: 269, 43: 278, 44: 287, 45: 296, 46: 305, 47: 314, 48: nan, 49: 323, 50: 332, 51: 341, 52: 350, 53: 359, 54: 368, 55: 377, 56: nan, 57: nan}, 9: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'Not in a couple household', 8: nan, 9: 9, 10: 18, 11: 27, 12: 36, 13: 45, 14: 54, 15: 63, 16: nan, 17: 72, 18: 81, 19: 90, 20: 99, 21: 108, 22: 117, 23: 126, 24: nan, 25: 135, 26: 144, 27: 153, 28: 162, 29: 171, 30: 180, 31: 189, 32: nan, 33: 198, 34: 207, 35: 216, 36: 225, 37: 234, 38: 243, 39: 252, 40: nan, 41: 261, 42: 270, 43: 279, 44: 288, 45: 297, 46: 306, 47: 315, 48: nan, 49: 324, 50: 333, 51: 342, 52: 351, 53: 360, 54: 369, 55: 378, 56: nan, 57: nan}}

Here is my complete code:

import pandas as pd

def read_table(table_name):
    table = f.parse(sheet_name=table_name, header=None)
    # Column headings start at row 6, upto row that has data in column 0
    ALL_CATEGORIES = "All categories: "
    COL_HEAD_ROW = 6
    row = COL_HEAD_ROW
    while pd.isna(table.iloc[row, 0]):
        row += 1
    # Previous row has column names, earlier rows (if any) have hierarchy
    col_levels = row - COL_HEAD_ROW
    col_level_names = []
    col_level_values = [[] for _ in range(col_levels)]
    col = 1
    row = COL_HEAD_ROW
    while col < len(table.columns):
        for r in range(col_levels):
            if pd.isna(table.iloc[row+r, col]):
                col_level_values[r].append(col_level_values[r][-1])
            else:
                value = table.iloc[row+r, col]
                if value.startswith(ALL_CATEGORIES):
                    category = value[len(ALL_CATEGORIES):]
                    value = 'All'
                    if len(col_level_names) <= r:
                        col_level_names.append(category)
                col_level_values[r].append(value)
        col += 1

    # Row headings start after column headings, last level has data in column 1
    ROW_HEAD_ROW = COL_HEAD_ROW+col_levels
    row = ROW_HEAD_ROW
    while pd.isna(table.iloc[row, 1]):
        row += 1
    # This row has row name, earlier rows (if any) have hierarchy
    row_levels = row + 1 - ROW_HEAD_ROW
    row_level_names = []
    row_level_values = [[] for _ in range(row_levels)]
    row = ROW_HEAD_ROW
    level = -1
    wasLevel = True
    data_row_indexes = []
    while row < len(table.index) and not pd.isna(table.iloc[row, 0]):
        if wasLevel or pd.isna(table.iloc[row, 1]):
            level += 1
            if level >= row_levels:
                level = 0
            if pd.isna(table.iloc[row, 1]):
                wasLevel = True
            else:
                wasLevel = False
        value = table.iloc[row, 0]
        if value.startswith(ALL_CATEGORIES):
            category = value[len(ALL_CATEGORIES):]
            value = 'All'
            if len(row_level_names) < row_levels:
                row_level_names.append(category)
        row_level_values[level].append(value)
        if level+1 == row_levels:
            for l in range(level):
                if len(row_level_values[l]) < len(row_level_values[level]):
                    row_level_values[l].append(row_level_values[l][-1])
            data_row_indexes.append(row)
        row += 1
    # print('tableName='+table_name)
    # print(col_level_names)
    # print(col_level_values)
    # print(row_level_names)
    # print(row_level_values)

    # Construct DataFrame
    num_cols = len(col_level_values[0])
    num_rows = len(row_level_values[0])
    for l in range(row_levels):
        # Repeat row values in order
        row_level_values[l] = [x for x in row_level_values[l]
                               for n in range(num_cols)]
    for l in range(col_levels):
        # Repeat col values in turn
        col_level_values[l] = col_level_values[l] * num_rows
    values = []
    for r in range(0, num_rows):
        row = data_row_indexes[r]
        values.extend(table.iloc[row, 1:])
    data = [row_level_values[l] for l in range(row_levels)] + \
           [col_level_values[l] for l in range(col_levels)] + \
        [values]
    index = row_level_names+col_level_names+['Dataset']
    df = pd.DataFrame(
        data=data,
        index=index
    )
    df = df.transpose()
    return df


if __name__ == '__main__':

    excel_file = 'data/BulkdatadetailedcharacteristicsmergedwardspluslaandregE&Wandinfo3.3/Cell Numbered DC Tables 3.3.xlsx'
    f = pd.ExcelFile(excel_file)
    index = f.parse(sheet_name='Index')

    df = read_table(index['Table Number'][6])
    print(df.head())


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source