'Python how to extract the xml part from xml.p7m file
I have to extract information from a xml.p7m (Italian invoice with digital signature function, I think at least.).
The extraction part is already done and works fine with the usual xml from Italy, but since we get those xml.p7m too (which I just recently discovered), I'm stuck, because I can't figure out how to deal with those.
I just want the xml part so I start with those splits to remove the signature part:
with open(path, encoding='unicode_escape') as f:
txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"
So what I'm stuck with now is that there are still parts like this in the xml:
""" <Anagrafica>
<Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>
</Anagraf♦♥èica>"""
which makes the xml not well formed, obviously and the data extraction is not working.
I have to use unicode_escape to open the file and remove those lines, because otherwise I would get an error because those signature parts can't be encoded in utf-8.
If I encode this part, I get:
b' <Anagrafica>\n <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>\n </Anagraf\xe2\x99\xa6\xe2\x99\xa5\xc3\xa8ica>'
Anyone an idea on how to extract only the xml part from the xml? Btw the xml should be: but if I open the xml, there are already characters that don't belong to the utf-8 charset or something?
Solution 1:[1]
Just so I can close this question: I "solved" it via removing all those parts I don't want to via replace.
def getXmlTextRemovedSignature(path):
txt = ""
try:
with open(path, encoding='unicode_escape') as f:
txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"
except Exception as e:
raise RuntimeError('File not found: ' + str(e)) from e
# Specal Characters witch translate to --> \x04\xc2\x82\x03\xc3\xa8
# <Nazione>IT</Nazione>??è
#<Descrizione>nr ordine 9??è303067091</Descrizione>
#<NumeroLinea>6<\Numero??èLinea>
#<Quant??èita>0.00</Quantita>
#</Anagraf??èica>
return txt.encode().replace(b"\x04\xc2\x82\x03\xc3\xa8",b'').decode("UTF8")
It's not pretty, that's for sure, but it works.
Solution 2:[2]
I had a similar problem, some chars in file were not decoded correctly. It was caused by a BOM file type.
You can try to use utf-8-sig encoding to read the file, like this:
with open(path, encoding='utf-8-sig') as f:
...
Solution 3:[3]
The easiest system to use is openssl:
C:\OpenSSL-Win64\bin\openssl.exe smime -verify -noverify -in **your.xml.p7m** -inform DER -out **your.xml**
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | user3793935 |
| Solution 2 | Mike |
| Solution 3 | AreToo |

