'Python code to flatten a nested XML runs out of memory with bigger (from 250MB) files

We are trying to flatten files that have nested tables. We want alle the tables in a separate .csv file where each line is a complete record. We have made a function which uses the pd.read_xml function together with an .xsl (1.0) file to perform XSLT transformation to create the records.

This works like a charm with smaller files (smaller than 250MB), but as soon as the file gets bigger, it does not work anymore. We are running this in a Azure Function and have upgraded to the highest premium plan to test if that is the solution. We have also tried this on a local PC. Both options keep failing. The Azure Function exits with code 137 and on a local PC it keeps running and after 24 hours it is still not finished and eventually times out.

Any suggestions or help is appreciated. It seems to be memory related, but we can't find a solution



Solution 1:[1]

To save memory on reading XML, use event-driven parsing. It feeds the elements one by one to your code, rather than building an object in memory. While this minimizes memory use in the parser, you still need to be careful that your own code works correctly with memory.

SAX is the main library for this. (Docs)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Joshua Fox