'Parsing XML with Python roadblock
I have reached a roadblock in my attempts to parse an xml file.
The file itself is larger and has more attributes, but, I think that this sample should be enough to convey my issue.
# this is sample of the file I am working with:
items.xml = <rest version="12" generated="2022-03-13">
<reps>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>14061</asn>
<category>bot</category>
<reputation_score>21</reputation_score>
<port>52465</port>
</rep>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>NA</asn>
<category>bot</category>
<reputation_score>20</reputation_score>
<port>59823</port>
</rep>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>4134</asn>
<category>bot</category>
<reputation_score>22</reputation_score>
<port>17322</port>
</rep>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>4812</asn>
<category>bot</category>
<reputation_score>100</reputation_score>
<port>48892</port>
</rep>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>3462</asn>
<category>bot</category>
<reputation_score>2</reputation_score>
<port>2516</port>
</rep>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>14061</asn>
<category>bot</category>
<reputation_score>63</reputation_score>
<port>58244</port>
</rep>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>4134</asn>
<category>bot</category>
<reputation_score>57</reputation_score>
<port>4647</port>
</rep>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>7684</asn>
<category>bruteforce</category>
<reputation_score>100</reputation_score>
<port>34700</port>
</rep>
<rep>
<stamp>2022-03-12 10:00:00</stamp>
<asn>14061</asn>
<category>bot</category>
<reputation_score>75</reputation_score>
<port>36988</port>
</rep>
</reps>
</rest>
# from here on I start parsing the file like so:
tree = et.parse('items.xml')
xroot = tree.getroot()
# I define the cols for my future df
df_cols = ["stamp", "asn", "category", "reputation_score", "port"]
# and the rows to iterate through the root
rows = []
# after this I try to retrieve the data
for node in xroot:
s_stamp = node.find("stamp").text if node is not None else None
s_category = node.find("category").text if node is not None else None
s_asn = node.find("asn").text if node is not None else None
s_reputation_score = node.find("reputation_score").text if node is not None else None
s_port = node.find("port").text if node is not None else None
rows.append({
"stamp": s_stamp,
"category": s_category,
"asn": s_asn,
"reputation_score": s_reputation_score,
"port": s_port,
})
out_df = pd.DataFrame(rows, columns = df_cols)
Unfortunately this code only retrieves:
AttributeError: 'NoneType' object has no attribute 'text'
attempting the following:
out_df = pd.read_xml(file_content)
Only returns and empty dataframe with a single empty column called "rep"
If you could help me find where I am going wrong I would really appreciate it.
Sources I used to get to this point of the code were:
Solution 1:[1]
It should be easier to do with with the read_xml() method:
df = pd.read_xml([your xml], xpath='//reps//rep')
df
Output (from your sample xml):
stamp asn category reputation_score port
0 2022-03-12 10:00:00 14061.0 bot 21 52465
1 2022-03-12 10:00:00 NaN bot 20 59823
2 2022-03-12 10:00:00 4134.0 bot 22 17322
etc.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jack Fleeting |
