'Large XML File Parsing in Python
I have an XML file of size 4 GB. I want to parse it and convert it to a Data Frame to work on it. But because the file size is too large the following code is unable to convert the file to a Pandas Data Frame. The code just keeps loading and does not provide any output. But when I use it for a similar file of smaller size I obtain the correct output.
Can anyone suggest any solution to this. Maybe a code that speeds up the process of conversion from XML to Data Frame or splitting of the XML file into smaller sub sets.
Any suggestion whether I should work with such large XML files on my personal system (2 GB RAM) or I should use Google Colab. If Google Colab, then is there any way to upload such large files quicker to drive and thus to Colab?
Following is the code I had used:
import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()
#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']
#Creating DataFrame
df = pd.DataFrame(columns = columns)
#Converting XML Tree to a Pandas DataFrame
for node in root:
row_Id = node.attrib.get("Id")
UserId = node.attrib.get("UserId")
Name = node.attrib.get("Name")
Date = node.attrib.get("Date")
Class = node.attrib.get("Class")
TagBased = node.attrib.get("TagBased")
df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)
Following is my XML File:
<badges>
<row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
<row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
Solution 1:[1]
Consider using cElementTree instead of ElementTree
https://effbot.org/zone/celementtree.htm
The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.
The cElementTree module is designed to replace the ElementTree module from the standard elementtree package. In theory, you should be able to simply change:
from elementtree import ElementTree
to
import cElementTree as ElementTree
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mads Hansen |
