'read xls file in pandas / python: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
I am trying to open an xls file (with only one tab) into a pandas dataframe.
It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&rlkey=3aw7whab78jeexbdkthkjzkmu .
I notice that the top two rows have merged cells and so do some of the columns.
I have tried several methods (from stack), which all fail.
# method 1 - read excel
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file)
print(df)
error: Excel file format cannot be determined, you must specify an engine manually.
# method 2 - pip install xlrd and use engine
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file, engine='xlrd')
print(df)
error: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
# method 3 - rename to xlsx and open with openpyxl
file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx"
df = pd.read_excel(file, engine='openpyxl')
print(df)
error: File is not a zip file
(possibly converting, as opposed to renaming, is an option).
# method 4 - use read_xml
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_xml(file)
print(df)
this method actually yields a result, but produces a DataFrame that has no meaning in relation to the sheet. presumably one needs to interpret the xml (seems complex) ?
Style Name Table
0 NaN None NaN
1 NaN All funds NaN
# method 5 - use read_table
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_table(file)
print(df)
This method reads the file into a one column (series) DataFrame. So how could one use this info to create a standard 2d DataFrame in the same shape as the xls file ?
0 <Workbook xmlns="urn:schemas-microsoft-com:off...
1 <Styles>
2 <Style ss:ID="Default">
3 <Alignment Horizontal="Left"/>
4 </Style>
... ...
226532 </Cell>
226533 </Row>
226534 </Table>
226535 </Worksheet>
226536 </Workbook>
# method 5 - use read_html
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_html(file)
print(df)
this returns a blank list [] whereas one might have expected at least a list of DataFrames.
So the question is what is the easiest method to read this file into a dataframe (or similar usable format) ?
Solution 1:[1]
I am posting the full solution here which contains the above approved solution (by @Stef) plus the final addition of the headers into the DataFrame.
'''
get xls file
convert to xml
parse into dataframe
add headers
'''
import pandas as pd
import xml.etree.ElementTree as ET
import shutil
file_xls = "C:\\Users\\admin\\Downloads\\product-screener.xls"
file_xml = 'C:\\Users\\admin\\Downloads\\product-screener.xml'
shutil.copyfile(file_xls, file_xml)
tree = ET.parse(file_xml)
root = tree.getroot()
data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]
df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
if types[c] == 'Number':
df[c] = pd.to_numeric(df[c])
elif types[c] == 'DateTime':
df[c] = pd.to_datetime(df[c])
print(df)
headers = [[c[0].text for c in r] for r in root[1][0][:2]]
# print(headers[0])
# print(len(headers[0]))
# print()
# print(headers[1])
# print(len(headers[1]))
# print()
# upto column (AF) comes from headers[0]
df_headers = headers[0][0:32]
# the next 9 are discrete
x_list = ['discrete: ' + s for s in headers[1][0:9] ]
df_headers = df_headers + x_list
# the next 10 are annualised
x_list = ['annualised: ' + s for s in headers[1][9:19] ]
df_headers = df_headers + x_list
# the next 10 are cumulative
x_list = ['cumulative: ' + s for s in headers[1][19:29] ]
df_headers = df_headers + x_list
# the next 9 are calendar
x_list = ['calendar: ' + s for s in headers[1][29:38] ]
df_headers = df_headers + x_list
# the next 5 are portfolio characteristics (metrics)
x_list = ['metrics: ' + s for s in headers[1][38:43] ]
df_headers = df_headers + x_list
# the next 6 are portfolio characteristics
x_list = ['characteristics: ' + s for s in headers[1][43:49] ]
df_headers = df_headers + x_list
# the final 5 are sustainability characteristics
x_list = ['sustain: ' + s for s in headers[1][49:54] ]
df_headers = df_headers + x_list
print(df_headers)
# add headers to dataframe
df.columns = df_headers
print(df)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | D.L |
