'Iterating through XMLs, making dataframes from nodes and merging them with a master dataframe. How should I optimize this code?

I'm trying to iterate through a lot of xml files that have ~1000 individual nodes that I want to iterate through to extract specific attributes (each node has 15 or so attributes, I only want one). In the end, there should be about 4 million rows. My code is below, but I have a feeling that it's not time efficient. What can I optimize about this?

import os, pandas as pd, xml.etree.ElementTree as xml

#init master df as accumulator of temp dfs
master_df = pd.DataFrame(
    columns = [
        'col1',
        'col2',
        'col3',
        'col4'
        ])
dir = 'C:\\somedir'

#iterate through files
for file in os.listdir(dir):
    #init xml handle and parse
    file = open(str(dir+"{}").format('\\'+file)
    parse = xml.parse(file)
    root = parse.getroot()
    
    #var assignments with desired data
    parent_node1 = str(root[0][0].get('pn1'))
    parent_node2 = str(root[0][1].get('pn2'))
    
    #resetting iteration dependent variables
    count = 0
    a_dict = {}
    
    #iterating through list of child nodes
    for i in list(root[1].iter())[1:]:
        child_node1 = str(i.get('cn1'))
        child_node2 = str(i.get('cn2'))
        a_dict.update({
            count: {
                "col1" : parent_node1,
                'col2': child_node1,
                "col3": parent_node2,
                "col4" : child_node2
                }})
        count = count+1
    temp_df = pd.DataFrame(a_dict).T
    master_df = pd.merge(
        left = master_df,
        right = temp_df,
        how = 'outer'
        )

Solution 1:^[1]

instead of initializing intermediate dataframes that are constantly being merged, I used nested lists, much faster under the hood and since I'm not expecting to handle any irregular data sets it should be fine. Otherwise, all other code is the same for parsing xml.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Carter Canedy

'Iterating through XMLs, making dataframes from nodes and merging them with a master dataframe. How should I optimize this code?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]