'Why is lxml.etree.iterparse() eating up all my memory?
This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference.
What am I doing wrong / how can I process this large file with iterparse()?
import lxml.etree
for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
print "why does this consume all my memory?"
I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.
Solution 1:[1]
Directly copied from http://effbot.org/zone/element-iterparse.htm
Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:
# get an iterable
context = iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "record":
... process record elements ...
root.clear()
Solution 2:[2]
This worked really well for me:
def destroy_tree(tree):
root = tree.getroot()
node_tracker = {root: [0, None]}
for node in root.iterdescendants():
parent = node.getparent()
node_tracker[node] = [node_tracker[parent][0] + 1, parent]
node_tracker = sorted([(depth, parent, child) for child, (depth, parent)
in node_tracker.items()], key=lambda x: x[0], reverse=True)
for _, parent, child in node_tracker:
if parent is None:
break
parent.remove(child)
del tree
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | sente |
| Solution 2 | PascalVKooten |
