'Removing BOM characters after adding <?xml version="1.0" encoding="UTF-8"?> to SQL-generated XML file with Python
I'm using Microsoft SQL Server Management Studio to create an XML file. This file needs at the top to be uploaded properly. I understand that this is fairly normal and I need to figure out how to add that line myself.
To add the line, I'm calling each of my files and modifying them with the following function:
def append_prologue(file, orgID, schema):
timestamp = datetime.today().strftime('%Y%m%d')
new_name = f'{orgID}_000_2022TSDS_{timestamp}1500_' + schema
new_file = file.parent.parent / 'results/with_prologue' / new_name
if new_file.exists():
print(f'{new_file.name} already exists')
with open(file, 'r') as original:
data = original.read()
data = data[3:] #how the original writer dealt with the issue
with open(new_file, 'w+') as modified:
modified.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + data)
return
However, this creates a problem. It will write but it adds "\ufeff" which I understand to be a BOM and the XML file can't be read properly. I took over this project for a coworker who left my company and they wrote this code. They addressed the issue by removing the BOM but it doesn't seem to work for me. I also suspect there's probably a more systematic way of doing it.
What am I doing wrong? Is there a way to remove these characters when I write the file? Should I be approaching this differently?
Solution 1:[1]
Codecs package should do the trick.
StreamReader = codecs.getreader('utf-8-sig')
with StreamReader(open(file, 'rb')) as original:
...
Or much shorter version:
with codecs.open(file, 'r', 'utf-8-sig') as original:
...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jinksy |