'Python how to extract the xml part from xml.p7m file

I have to extract information from a xml.p7m (Italian invoice with digital signature function, I think at least.).

The extraction part is already done and works fine with the usual xml from Italy, but since we get those xml.p7m too (which I just recently discovered), I'm stuck, because I can't figure out how to deal with those.

I just want the xml part so I start with those splits to remove the signature part:

with open(path, encoding='unicode_escape') as f:
    txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
    txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"        

So what I'm stuck with now is that there are still parts like this in the xml:

    """ <Anagrafica>
              <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>
            </Anagraf♦♥èica>"""

which makes the xml not well formed, obviously and the data extraction is not working.

I have to use unicode_escape to open the file and remove those lines, because otherwise I would get an error because those signature parts can't be encoded in utf-8.

If I encode this part, I get:

    b' <Anagrafica>\n          <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>\n        </Anagraf\xe2\x99\xa6\xe2\x99\xa5\xc3\xa8ica>'

Anyone an idea on how to extract only the xml part from the xml? Btw the xml should be: but if I open the xml, there are already characters that don't belong to the utf-8 charset or something?

enter image description here



Solution 1:[1]

Just so I can close this question: I "solved" it via removing all those parts I don't want to via replace.

def getXmlTextRemovedSignature(path):
        txt = ""
        try:
            with open(path, encoding='unicode_escape') as f:
                txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
                txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"        
        except Exception as e:
            raise RuntimeError('File not found: ' + str(e)) from e 
        # Specal Characters witch translate to --> \x04\xc2\x82\x03\xc3\xa8
        # <Nazione>IT</Nazione>??è
        #<Descrizione>nr ordine 9??è303067091</Descrizione>
        #<NumeroLinea>6<\Numero??èLinea>
        #<Quant??èita>0.00</Quantita>
        #</Anagraf??èica>
        return txt.encode().replace(b"\x04\xc2\x82\x03\xc3\xa8",b'').decode("UTF8")

It's not pretty, that's for sure, but it works.

Solution 2:[2]

I had a similar problem, some chars in file were not decoded correctly. It was caused by a BOM file type.

You can try to use utf-8-sig encoding to read the file, like this:

with open(path, encoding='utf-8-sig') as f:
   ...

Solution 3:[3]

The easiest system to use is openssl:

C:\OpenSSL-Win64\bin\openssl.exe smime -verify -noverify -in **your.xml.p7m** -inform DER -out **your.xml**

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user3793935
Solution 2 Mike
Solution 3 AreToo