'Read multi-dimensional table from Excel into DataFrame
I have a UK 2011 Census Statistics file. I have written somewhat painful code to load it but am wondering if there is a simpler solution.
The file has many sheets, with similar but slightly varying structure. Here is an image of a complex one:
So, the general description of a sheet is:
- Some text rows
- One or two rows of hierarchical column headings
- Either one level, or two-level hierarchical data rows
I wish to create a DataFrame with columns for each of the column and row categories, and one column for the table data. The example sheet should give:
Age 16 and over All households Sex Household type Dataset
0 All All All All 1
1 All All All In a couple household 2
2 All All All Not in a couple household 3
3 All All Males All 4
4 All All Males In a couple household 5
...
(I have converted entries of the type "All categories: X" to "All".)
I use pd.parse to load the sheet. Here is a fragment of the resulting DataFrame:
0 1 2 3
0 Back to Index NaN NaN NaN
1 NaN NaN NaN NaN
2 DC1112EWla: Age of youngest dependent child by... NaN NaN NaN
3 NaN NaN NaN NaN
4 Table population: All usual residents aged 16 ... NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN All categories: Sex NaN NaN
7 NaN All categories: Household type In a couple household Not in a couple household
8 All categories: Age 16 and over NaN NaN NaN
9 All categories: All households 1 2 3
10 Households with no dependent children 10 11 12
11 Households with dependent children: Total 19 20 21
12 Youngest dependent child: Age 0 to 4 28 29 30
13 Youngest dependent child: Age 5 to 9 37 38 39
14 Youngest dependent child: Age 10 to 15 46 47 48
15 Youngest dependent child: Age 16 to 18 55 56 57
16 Age 16 to 24 NaN NaN NaN
17 All categories: All households 64 65 66
18 Households with no dependent children 73 74 75
19 Households with dependent children: Total 82 83 84
20 Youngest dependent child: Age 0 to 4 91 92 93
21 Youngest dependent child: Age 5 to 9 100 101 102
22 Youngest dependent child: Age 10 to 15 109 110 111
I have written somewhat painful code that extracts this data from the sheet:
col_level_names- names of the column categories, e.g.['Sex', 'Household type']col_level_values- values of the column categories for every column, e.g.[['All', 'All', 'All', 'Males', 'Males', 'Males', 'Females', 'Females', 'Females'], ['All', 'In a couple household', ...]row_level_names- names of the row categories, e.g.['Age 16 and over', 'All households']row_level_values- values of the row categories for every row, e.g.[['All', ..., 'Age 65 and over'], ['All', 'Households with no dependent children', ...]]
I then construct the DataFrame like this:
num_cols = len(col_level_values[0])
num_rows = len(row_level_values[0])
for l in range(row_levels):
# Repeat row values in order
row_level_values[l] = [x for x in row_level_values[l]
for n in range(num_cols)]
for l in range(col_levels):
# Repeat col values in turn
col_level_values[l] = col_level_values[l] * num_rows
values = []
for r in range(0, num_rows):
row = data_row_indexes[r]
values.extend(table.iloc[row, 1:])
data = [row_level_values[l] for l in range(row_levels)] + \
[col_level_values[l] for l in range(col_levels)] + \
[values]
index = row_level_names+col_level_names+['Dataset']
df = pd.DataFrame(
data=data,
index=index
)
df = df.transpose()
Is there a simpler way to parse the sheet and construct my DataFrame?
FYI Here is the sheet in dictionary form:
{0: {0: 'Back to Index', 1: nan, 2: 'DC1112EWla: Age of youngest dependent child by household type by sex by age', 3: nan, 4: 'Table population: All usual residents aged 16 and over in households', 5: nan, 6: nan, 7: nan, 8: 'All categories: Age 16 and over', 9: 'All categories: All households', 10: 'Households with no dependent children', 11: 'Households with dependent children: Total', 12: 'Youngest dependent child: Age 0 to 4', 13: 'Youngest dependent child: Age 5 to 9', 14: 'Youngest dependent child: Age 10 to 15', 15: 'Youngest dependent child: Age 16 to 18', 16: 'Age 16 to 24', 17: 'All categories: All households', 18: 'Households with no dependent children', 19: 'Households with dependent children: Total', 20: 'Youngest dependent child: Age 0 to 4', 21: 'Youngest dependent child: Age 5 to 9', 22: 'Youngest dependent child: Age 10 to 15', 23: 'Youngest dependent child: Age 16 to 18', 24: 'Age 25 to 34', 25: 'All categories: All households', 26: 'Households with no dependent children', 27: 'Households with dependent children: Total', 28: 'Youngest dependent child: Age 0 to 4', 29: 'Youngest dependent child: Age 5 to 9', 30: 'Youngest dependent child: Age 10 to 15', 31: 'Youngest dependent child: Age 16 to 18', 32: 'Age 35 to 49', 33: 'All categories: All households', 34: 'Households with no dependent children', 35: 'Households with dependent children: Total', 36: 'Youngest dependent child: Age 0 to 4', 37: 'Youngest dependent child: Age 5 to 9', 38: 'Youngest dependent child: Age 10 to 15', 39: 'Youngest dependent child: Age 16 to 18', 40: 'Age 50 to 64', 41: 'All categories: All households', 42: 'Households with no dependent children', 43: 'Households with dependent children: Total', 44: 'Youngest dependent child: Age 0 to 4', 45: 'Youngest dependent child: Age 5 to 9', 46: 'Youngest dependent child: Age 10 to 15', 47: 'Youngest dependent child: Age 16 to 18', 48: 'Age 65 and over', 49: 'All categories: All households', 50: 'Households with no dependent children', 51: 'Households with dependent children: Total', 52: 'Youngest dependent child: Age 0 to 4', 53: 'Youngest dependent child: Age 5 to 9', 54: 'Youngest dependent child: Age 10 to 15', 55: 'Youngest dependent child: Age 16 to 18', 56: nan, 57: 'Crown Copyright applies unless otherwise stated, [email protected]'}, 1: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: 'All categories: Sex', 7: 'All categories: Household type', 8: nan, 9: 1, 10: 10, 11: 19, 12: 28, 13: 37, 14: 46, 15: 55, 16: nan, 17: 64, 18: 73, 19: 82, 20: 91, 21: 100, 22: 109, 23: 118, 24: nan, 25: 127, 26: 136, 27: 145, 28: 154, 29: 163, 30: 172, 31: 181, 32: nan, 33: 190, 34: 199, 35: 208, 36: 217, 37: 226, 38: 235, 39: 244, 40: nan, 41: 253, 42: 262, 43: 271, 44: 280, 45: 289, 46: 298, 47: 307, 48: nan, 49: 316, 50: 325, 51: 334, 52: 343, 53: 352, 54: 361, 55: 370, 56: nan, 57: nan}, 2: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'In a couple household', 8: nan, 9: 2, 10: 11, 11: 20, 12: 29, 13: 38, 14: 47, 15: 56, 16: nan, 17: 65, 18: 74, 19: 83, 20: 92, 21: 101, 22: 110, 23: 119, 24: nan, 25: 128, 26: 137, 27: 146, 28: 155, 29: 164, 30: 173, 31: 182, 32: nan, 33: 191, 34: 200, 35: 209, 36: 218, 37: 227, 38: 236, 39: 245, 40: nan, 41: 254, 42: 263, 43: 272, 44: 281, 45: 290, 46: 299, 47: 308, 48: nan, 49: 317, 50: 326, 51: 335, 52: 344, 53: 353, 54: 362, 55: 371, 56: nan, 57: nan}, 3: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'Not in a couple household', 8: nan, 9: 3, 10: 12, 11: 21, 12: 30, 13: 39, 14: 48, 15: 57, 16: nan, 17: 66, 18: 75, 19: 84, 20: 93, 21: 102, 22: 111, 23: 120, 24: nan, 25: 129, 26: 138, 27: 147, 28: 156, 29: 165, 30: 174, 31: 183, 32: nan, 33: 192, 34: 201, 35: 210, 36: 219, 37: 228, 38: 237, 39: 246, 40: nan, 41: 255, 42: 264, 43: 273, 44: 282, 45: 291, 46: 300, 47: 309, 48: nan, 49: 318, 50: 327, 51: 336, 52: 345, 53: 354, 54: 363, 55: 372, 56: nan, 57: nan}, 4: {0: nan, 1: nan, 2: ' ', 3: nan, 4: nan, 5: nan, 6: 'Males', 7: 'All categories: Household type', 8: nan, 9: 4, 10: 13, 11: 22, 12: 31, 13: 40, 14: 49, 15: 58, 16: nan, 17: 67, 18: 76, 19: 85, 20: 94, 21: 103, 22: 112, 23: 121, 24: nan, 25: 130, 26: 139, 27: 148, 28: 157, 29: 166, 30: 175, 31: 184, 32: nan, 33: 193, 34: 202, 35: 211, 36: 220, 37: 229, 38: 238, 39: 247, 40: nan, 41: 256, 42: 265, 43: 274, 44: 283, 45: 292, 46: 301, 47: 310, 48: nan, 49: 319, 50: 328, 51: 337, 52: 346, 53: 355, 54: 364, 55: 373, 56: nan, 57: nan}, 5: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'In a couple household', 8: nan, 9: 5, 10: 14, 11: 23, 12: 32, 13: 41, 14: 50, 15: 59, 16: nan, 17: 68, 18: 77, 19: 86, 20: 95, 21: 104, 22: 113, 23: 122, 24: nan, 25: 131, 26: 140, 27: 149, 28: 158, 29: 167, 30: 176, 31: 185, 32: nan, 33: 194, 34: 203, 35: 212, 36: 221, 37: 230, 38: 239, 39: 248, 40: nan, 41: 257, 42: 266, 43: 275, 44: 284, 45: 293, 46: 302, 47: 311, 48: nan, 49: 320, 50: 329, 51: 338, 52: 347, 53: 356, 54: 365, 55: 374, 56: nan, 57: nan}, 6: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'Not in a couple household', 8: nan, 9: 6, 10: 15, 11: 24, 12: 33, 13: 42, 14: 51, 15: 60, 16: nan, 17: 69, 18: 78, 19: 87, 20: 96, 21: 105, 22: 114, 23: 123, 24: nan, 25: 132, 26: 141, 27: 150, 28: 159, 29: 168, 30: 177, 31: 186, 32: nan, 33: 195, 34: 204, 35: 213, 36: 222, 37: 231, 38: 240, 39: 249, 40: nan, 41: 258, 42: 267, 43: 276, 44: 285, 45: 294, 46: 303, 47: 312, 48: nan, 49: 321, 50: 330, 51: 339, 52: 348, 53: 357, 54: 366, 55: 375, 56: nan, 57: nan}, 7: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: 'Females', 7: 'All categories: Household type', 8: nan, 9: 7, 10: 16, 11: 25, 12: 34, 13: 43, 14: 52, 15: 61, 16: nan, 17: 70, 18: 79, 19: 88, 20: 97, 21: 106, 22: 115, 23: 124, 24: nan, 25: 133, 26: 142, 27: 151, 28: 160, 29: 169, 30: 178, 31: 187, 32: nan, 33: 196, 34: 205, 35: 214, 36: 223, 37: 232, 38: 241, 39: 250, 40: nan, 41: 259, 42: 268, 43: 277, 44: 286, 45: 295, 46: 304, 47: 313, 48: nan, 49: 322, 50: 331, 51: 340, 52: 349, 53: 358, 54: 367, 55: 376, 56: nan, 57: nan}, 8: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'In a couple household', 8: nan, 9: 8, 10: 17, 11: 26, 12: 35, 13: 44, 14: 53, 15: 62, 16: nan, 17: 71, 18: 80, 19: 89, 20: 98, 21: 107, 22: 116, 23: 125, 24: nan, 25: 134, 26: 143, 27: 152, 28: 161, 29: 170, 30: 179, 31: 188, 32: nan, 33: 197, 34: 206, 35: 215, 36: 224, 37: 233, 38: 242, 39: 251, 40: nan, 41: 260, 42: 269, 43: 278, 44: 287, 45: 296, 46: 305, 47: 314, 48: nan, 49: 323, 50: 332, 51: 341, 52: 350, 53: 359, 54: 368, 55: 377, 56: nan, 57: nan}, 9: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: 'Not in a couple household', 8: nan, 9: 9, 10: 18, 11: 27, 12: 36, 13: 45, 14: 54, 15: 63, 16: nan, 17: 72, 18: 81, 19: 90, 20: 99, 21: 108, 22: 117, 23: 126, 24: nan, 25: 135, 26: 144, 27: 153, 28: 162, 29: 171, 30: 180, 31: 189, 32: nan, 33: 198, 34: 207, 35: 216, 36: 225, 37: 234, 38: 243, 39: 252, 40: nan, 41: 261, 42: 270, 43: 279, 44: 288, 45: 297, 46: 306, 47: 315, 48: nan, 49: 324, 50: 333, 51: 342, 52: 351, 53: 360, 54: 369, 55: 378, 56: nan, 57: nan}}
Here is my complete code:
import pandas as pd
def read_table(table_name):
table = f.parse(sheet_name=table_name, header=None)
# Column headings start at row 6, upto row that has data in column 0
ALL_CATEGORIES = "All categories: "
COL_HEAD_ROW = 6
row = COL_HEAD_ROW
while pd.isna(table.iloc[row, 0]):
row += 1
# Previous row has column names, earlier rows (if any) have hierarchy
col_levels = row - COL_HEAD_ROW
col_level_names = []
col_level_values = [[] for _ in range(col_levels)]
col = 1
row = COL_HEAD_ROW
while col < len(table.columns):
for r in range(col_levels):
if pd.isna(table.iloc[row+r, col]):
col_level_values[r].append(col_level_values[r][-1])
else:
value = table.iloc[row+r, col]
if value.startswith(ALL_CATEGORIES):
category = value[len(ALL_CATEGORIES):]
value = 'All'
if len(col_level_names) <= r:
col_level_names.append(category)
col_level_values[r].append(value)
col += 1
# Row headings start after column headings, last level has data in column 1
ROW_HEAD_ROW = COL_HEAD_ROW+col_levels
row = ROW_HEAD_ROW
while pd.isna(table.iloc[row, 1]):
row += 1
# This row has row name, earlier rows (if any) have hierarchy
row_levels = row + 1 - ROW_HEAD_ROW
row_level_names = []
row_level_values = [[] for _ in range(row_levels)]
row = ROW_HEAD_ROW
level = -1
wasLevel = True
data_row_indexes = []
while row < len(table.index) and not pd.isna(table.iloc[row, 0]):
if wasLevel or pd.isna(table.iloc[row, 1]):
level += 1
if level >= row_levels:
level = 0
if pd.isna(table.iloc[row, 1]):
wasLevel = True
else:
wasLevel = False
value = table.iloc[row, 0]
if value.startswith(ALL_CATEGORIES):
category = value[len(ALL_CATEGORIES):]
value = 'All'
if len(row_level_names) < row_levels:
row_level_names.append(category)
row_level_values[level].append(value)
if level+1 == row_levels:
for l in range(level):
if len(row_level_values[l]) < len(row_level_values[level]):
row_level_values[l].append(row_level_values[l][-1])
data_row_indexes.append(row)
row += 1
# print('tableName='+table_name)
# print(col_level_names)
# print(col_level_values)
# print(row_level_names)
# print(row_level_values)
# Construct DataFrame
num_cols = len(col_level_values[0])
num_rows = len(row_level_values[0])
for l in range(row_levels):
# Repeat row values in order
row_level_values[l] = [x for x in row_level_values[l]
for n in range(num_cols)]
for l in range(col_levels):
# Repeat col values in turn
col_level_values[l] = col_level_values[l] * num_rows
values = []
for r in range(0, num_rows):
row = data_row_indexes[r]
values.extend(table.iloc[row, 1:])
data = [row_level_values[l] for l in range(row_levels)] + \
[col_level_values[l] for l in range(col_levels)] + \
[values]
index = row_level_names+col_level_names+['Dataset']
df = pd.DataFrame(
data=data,
index=index
)
df = df.transpose()
return df
if __name__ == '__main__':
excel_file = 'data/BulkdatadetailedcharacteristicsmergedwardspluslaandregE&Wandinfo3.3/Cell Numbered DC Tables 3.3.xlsx'
f = pd.ExcelFile(excel_file)
index = f.parse(sheet_name='Index')
df = read_table(index['Table Number'][6])
print(df.head())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

