'How to convert xml file into pandas DataFrame

I am working in sameval 2014 task 4 dataset. it is an xml file, that looks like:

<?xml version="1.0" encoding="UTF-8"?>
-<sentences>


-<sentence id="2339">
        <text>I charge it at night and skip taking the cord with me because of the good battery life.</text>


-<aspectTerms>
            <aspectTerm to="45" from="41" polarity="neutral" term="cord"/>
            <aspectTerm to="86" from="74" polarity="positive" term="battery life"/>
        </aspectTerms>
    </sentence>


-<sentence id="812">
        <text>I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.</text>
    </sentence>


-<sentence id="1316">
        <text>The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.</text>


-<aspectTerms>
            <aspectTerm to="41" from="27" polarity="negative" term="service center"/>
            <aspectTerm to="121" from="109" polarity="negative" term="" sales" team"/>
            <aspectTerm to="12" from="4" polarity="neutral" term="tech guy"/>
        </aspectTerms>
    </sentence>


-<sentence id="2328">
        <text>I investigated netbooks and saw the Toshiba NB305-N410BL.</text>
    </sentence>
-<sentence id="1712">
        <text>This was an update from an early MacBook Pro.</text>
    </sentence>


-<sentence id="2982">
        <text>I think I might rather suffer for something that is simple to fix in my opinion.</text>
    </sentence>
</sentences>

I have to convert it into pandas DataFrame with the id, text, term, polarity columns. So resulting DataFrame should look like

id       text                                    term          polarity
2339     I charge it at night and skip...        cord          neutral
2339     I charge it at night and skip...        batterylife   positive

I tried to make an ordered dict

with open('/content/Laptops_Train.xml', 'r', encoding='utf-8') as file:
my_xml = file.read()

I used xmltodict library to parse and convert the XML document

laptop_dict = xmltodict.parse(my_xml)

then I converted into DataFrame:

laptop_dict = json.loads(json.dumps(laptop_dict))

user_ids = []
frames = []

for user_id, d in laptop_dict.items():
user_ids.append(user_id)
frames.append(pd.DataFrame.from_dict(d))

df =pd.concat(list(frames), keys=user_ids)

the DataFrame looks like:

I tried stripping it using str.split but it gives error

ValueError: Columns must be same length as key

How can I convert it into desired format?



Solution 1:[1]

You can use xml.etree.ElementTree like this:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("Laptops_Train.xml")
root = tree.getroot()

get_range = lambda col: range(len(col))
l = [{r[i].tag:r[i].text for i in get_range(r)} for r in root]

df = pd.DataFrame.from_dict(l)
print(df)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gaston Alex