'Removing BOM characters after adding <?xml version="1.0" encoding="UTF-8"?> to SQL-generated XML file with Python

I'm using Microsoft SQL Server Management Studio to create an XML file. This file needs at the top to be uploaded properly. I understand that this is fairly normal and I need to figure out how to add that line myself.

To add the line, I'm calling each of my files and modifying them with the following function:

def append_prologue(file, orgID, schema):
    timestamp = datetime.today().strftime('%Y%m%d')
    new_name = f'{orgID}_000_2022TSDS_{timestamp}1500_' + schema
    new_file = file.parent.parent / 'results/with_prologue' / new_name
    if new_file.exists():
        print(f'{new_file.name} already exists')
    with open(file, 'r') as original:
        data = original.read()
        data = data[3:] #how the original writer dealt with the issue
    with open(new_file, 'w+') as modified:
        modified.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + data)
    return

However, this creates a problem. It will write but it adds "\ufeff" which I understand to be a BOM and the XML file can't be read properly. I took over this project for a coworker who left my company and they wrote this code. They addressed the issue by removing the BOM but it doesn't seem to work for me. I also suspect there's probably a more systematic way of doing it.

What am I doing wrong? Is there a way to remove these characters when I write the file? Should I be approaching this differently?



Solution 1:[1]

Codecs package should do the trick.

StreamReader = codecs.getreader('utf-8-sig')
with StreamReader(open(file, 'rb')) as original:
    ...

Or much shorter version:

with codecs.open(file, 'r', 'utf-8-sig') as original:
    ...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jinksy