'python parsing XML parsing attributes containing "xml" in the name
I dont understand the following behaiviour. I am parsing XML string by passing a string into XML eetree as follows:
from lxml import etree
mini_example = '''
<body>
<div facs="pre-publication" type="description" xml:base="/api/emi" xml:id="f877d62ae6e2c8ab4011c81c474217e0" xml:lang="en">
<title desc="invention-title">Low tech high output PV</title>
<head xml:id="_45a0fe0003">FIELD </head>
<p n="0001" xml:id="_45a0fe0004">The present doc is about fooball</p>
<head xml:id="_45a0fe0005">BACKGROUND</head>
<p n="0002" xml:id="_45a0fe0006">Once upon a time</p>
</div>
</body>'''
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(mini_example.encode() , parser=parser)
paragraphs = './/p[@xml:id]'
heads = './/head[@xml:id]'
titles = './/title'
xml_query = '|'.join([paragraphs, heads, titles])
all_elements = XML_tree.xpath(xml_query)
When parsing the attributes of the elements I get:
for para in all_elements:
print(para.attrib)
which results in:
{'desc': 'invention-title'}
{'{http://www.w3.org/XML/1998/namespace}id': '_45a0fe0003'}
{'n': '0001', '{http://www.w3.org/XML/1998/namespace}id': '_45a0fe0004'}
{'{http://www.w3.org/XML/1998/namespace}id': '_45a0fe0005'}
{'n': '0002', '{http://www.w3.org/XML/1998/namespace}id': '_45a0fe0006'}
So the attribute name "XML:" gets transformed into "{http://www.w3.org/XML/1998/namespace}".
Why is this happening?
and of course,what should I do? since I have multiple xml:id,xml:base, etc names of attributes.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
