'Python code to flatten a nested XML runs out of memory with bigger (from 250MB) files
We are trying to flatten files that have nested tables. We want alle the tables in a separate .csv file where each line is a complete record. We have made a function which uses the pd.read_xml function together with an .xsl (1.0) file to perform XSLT transformation to create the records.
This works like a charm with smaller files (smaller than 250MB), but as soon as the file gets bigger, it does not work anymore. We are running this in a Azure Function and have upgraded to the highest premium plan to test if that is the solution. We have also tried this on a local PC. Both options keep failing. The Azure Function exits with code 137 and on a local PC it keeps running and after 24 hours it is still not finished and eventually times out.
Any suggestions or help is appreciated. It seems to be memory related, but we can't find a solution
Solution 1:[1]
To save memory on reading XML, use event-driven parsing. It feeds the elements one by one to your code, rather than building an object in memory. While this minimizes memory use in the parser, you still need to be careful that your own code works correctly with memory.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Joshua Fox |
