'How to convert xml file into pandas DataFrame
I am working in sameval 2014 task 4 dataset. it is an xml file, that looks like:
<?xml version="1.0" encoding="UTF-8"?>
-<sentences>
-<sentence id="2339">
<text>I charge it at night and skip taking the cord with me because of the good battery life.</text>
-<aspectTerms>
<aspectTerm to="45" from="41" polarity="neutral" term="cord"/>
<aspectTerm to="86" from="74" polarity="positive" term="battery life"/>
</aspectTerms>
</sentence>
-<sentence id="812">
<text>I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.</text>
</sentence>
-<sentence id="1316">
<text>The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.</text>
-<aspectTerms>
<aspectTerm to="41" from="27" polarity="negative" term="service center"/>
<aspectTerm to="121" from="109" polarity="negative" term="" sales" team"/>
<aspectTerm to="12" from="4" polarity="neutral" term="tech guy"/>
</aspectTerms>
</sentence>
-<sentence id="2328">
<text>I investigated netbooks and saw the Toshiba NB305-N410BL.</text>
</sentence>
-<sentence id="1712">
<text>This was an update from an early MacBook Pro.</text>
</sentence>
-<sentence id="2982">
<text>I think I might rather suffer for something that is simple to fix in my opinion.</text>
</sentence>
</sentences>
I have to convert it into pandas DataFrame with the id, text, term, polarity columns. So resulting DataFrame should look like
id text term polarity
2339 I charge it at night and skip... cord neutral
2339 I charge it at night and skip... batterylife positive
I tried to make an ordered dict
with open('/content/Laptops_Train.xml', 'r', encoding='utf-8') as file:
my_xml = file.read()
I used xmltodict library to parse and convert the XML document
laptop_dict = xmltodict.parse(my_xml)
then I converted into DataFrame:
laptop_dict = json.loads(json.dumps(laptop_dict))
user_ids = []
frames = []
for user_id, d in laptop_dict.items():
user_ids.append(user_id)
frames.append(pd.DataFrame.from_dict(d))
df =pd.concat(list(frames), keys=user_ids)
the DataFrame looks like:
I tried stripping it using str.split but it gives error
ValueError: Columns must be same length as key
How can I convert it into desired format?
Solution 1:[1]
You can use xml.etree.ElementTree like this:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("Laptops_Train.xml")
root = tree.getroot()
get_range = lambda col: range(len(col))
l = [{r[i].tag:r[i].text for i in get_range(r)} for r in root]
df = pd.DataFrame.from_dict(l)
print(df)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Gaston Alex |

