'Parsing XML with Python roadblock

I have reached a roadblock in my attempts to parse an xml file.

The file itself is larger and has more attributes, but, I think that this sample should be enough to convey my issue.

# this is sample of the file I am working with:

items.xml = <rest version="12" generated="2022-03-13">
    <reps>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>14061</asn>
            <category>bot</category>
            <reputation_score>21</reputation_score>
            <port>52465</port>
        </rep>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>NA</asn>
            <category>bot</category>
            <reputation_score>20</reputation_score>
            <port>59823</port>
        </rep>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>4134</asn>
            <category>bot</category>
            <reputation_score>22</reputation_score>
            <port>17322</port>
        </rep>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>4812</asn>
            <category>bot</category>
            <reputation_score>100</reputation_score>
            <port>48892</port>
        </rep>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>3462</asn>
            <category>bot</category>
            <reputation_score>2</reputation_score>
            <port>2516</port>
        </rep>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>14061</asn>
            <category>bot</category>
            <reputation_score>63</reputation_score>
            <port>58244</port>
        </rep>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>4134</asn>
            <category>bot</category>
            <reputation_score>57</reputation_score>
            <port>4647</port>
        </rep>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>7684</asn>
            <category>bruteforce</category>
            <reputation_score>100</reputation_score>
            <port>34700</port>
        </rep>
        <rep>
            <stamp>2022-03-12 10:00:00</stamp>
            <asn>14061</asn>
            <category>bot</category>
            <reputation_score>75</reputation_score>
            <port>36988</port>
        </rep>
    </reps>
</rest>

# from here on I start parsing the file like so:
tree = et.parse('items.xml')
xroot = tree.getroot()


# I define the cols for my future df
df_cols = ["stamp", "asn", "category", "reputation_score", "port"] 

# and the rows to iterate through the root
rows = []

# after this I try to retrieve the data
for node in xroot: 
    s_stamp = node.find("stamp").text if node is not None else None
    s_category = node.find("category").text if node is not None else None
    s_asn = node.find("asn").text if node is not None else None
    s_reputation_score = node.find("reputation_score").text if node is not None else None
    s_port = node.find("port").text if node is not None else None

    rows.append({
                 "stamp": s_stamp, 
                 "category": s_category, 
                 "asn": s_asn,
                 "reputation_score": s_reputation_score, 
                 "port": s_port, 
                 })
out_df = pd.DataFrame(rows, columns = df_cols)

Unfortunately this code only retrieves:

AttributeError: 'NoneType' object has no attribute 'text'

attempting the following:

out_df = pd.read_xml(file_content)

Only returns and empty dataframe with a single empty column called "rep"

If you could help me find where I am going wrong I would really appreciate it.

Sources I used to get to this point of the code were:

source1 source2



Solution 1:[1]

It should be easier to do with with the read_xml() method:

df = pd.read_xml([your xml], xpath='//reps//rep')
df

Output (from your sample xml):

    stamp      asn       category   reputation_score    port
0   2022-03-12 10:00:00     14061.0     bot     21  52465
1   2022-03-12 10:00:00     NaN     bot     20  59823
2   2022-03-12 10:00:00     4134.0  bot     22  17322

etc.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jack Fleeting