'Parsing and extracting data from multiple xml files in a directory in order to export to csv
Could someone please help me with my code? I have only just begun to learn Python, in order to complete an assignment that probablty isn't as hard as I find it to be.
I need to write a Python script to facilitate extracting metadata of publishing material, stored in TEI xml files. To practice I was given 50 typical files with dummy data. I need to extract, for now, data from two tags in one of the elements.
These files are all alike, but they do not contain repeating data. The repetition is in that all files have the same structure, and the same tags. I need the data of these -deeply- nested tags.
I find it very difficult to parse through multiple files in stead of just one.
The xml files all look like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE encyclopedia SYSTEM "Encyclopedia.dtd">
<encyclopedia>
<div2 id="1">
<head />
<art>
<dummyarticle targets="http://referenceworks.brillonline.com/entries/brill-s-new-pauly/*brill000410" idno.doi="http://dx.doi.org/10.1163/1574-9347_bnp_brill000410" id="brill000410" entry="Child emperors" volume="" page="3:223" first-online="20061001" last-update="" first-print="9789004122598, 20110510">
<pseudoarticle>
<articleentry>
<mainentry>Child emperors</mainentry>
</articleentry>
<p>see  Emperors, child</p>
</pseudoarticle>
</dummyarticle>
</art>
</div2>
</encyclopedia>
I need to extract the id and entry tags from the element 'dummyarticle'. I then need to create csv file containing the data, for now two columns with 50 rows,
Like so:
id;entry
brill000410;Child emperor
brill000450;Clientela, military
brill000460;Clyster
...etc
For now I ask your help with the parsing through all the xml files a directory holds. This is the code I have so far:
import csv
import xml.etree.ElementTree as ET
import os
os.chdir('c:\\Users\\HP\\for_Anne')
print(os.getcwd())
for f in os.listdir():
if f.endswith(".xml"):
continue
dir(ET)
tree = ET.parse('C:/Users/HP/for_Anne/1574-9347_bnp_fulltextxml_brill000410.xml')
root = tree.getroot()
data = []
# testing:
# print(ET.tostring(root, encoding='utf8').decode('utf8'))
# create a csv file containing headers of attributes
attr_list = ['Id', 'entry']
for f in tree.findall('.//{"http://referenceworks.brillonline.com/entries/brill-s-new-pauly/*brill000410"}File'):
data.append({a:f.attrib[a] for a in attr_list})
with open('Data.csv', 'w') as f:
w = csv.DictWriter(f, fieldnames=attr_list)
w.writeheader()
w.writerows(data)
# tree.write('C:/Users/HP/for_Anne/1574-9347_bnp_fulltextxml_brill000410.xml')
Thanks a lot for thinking along!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
