'XmlReader breaks on UTF-8 BOM

I have the following XML Parsing code in my application:

    public static XElement Parse(string xml, string xsdFilename)
    {
        var readerSettings = new XmlReaderSettings
        {
            ValidationType = ValidationType.Schema,
            Schemas = new XmlSchemaSet()
        };
        readerSettings.Schemas.Add(null, xsdFilename);
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessSchemaLocation;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
        readerSettings.ValidationEventHandler +=
            (o, e) => { throw new Exception("The provided XML does not validate against the request's schema."); };

        var readerContext = new XmlParserContext(null, null, null, XmlSpace.Default, Encoding.UTF8);

        return XElement.Load(XmlReader.Create(new StringReader(xml), readerSettings, readerContext));
    }

I am using it to parse strings sent to my WCF service into XML documents, for custom deserialization.

It works fine when I read in files and send them over the wire (the request); I've verified that the BOM is not sent across. In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

System.Xml.XmlException : Data at the root level is invalid. Line 1, position 1.

In the research I've done over the last hour or so, it appears that XmlReader should honor the BOM. If I manually remove the BOM from the front of the string, the response xml parses fine.

Am I missing something obvious, or at least something insidious?

EDIT: Here is the serialization code I'm using to return the response:

private static string SerializeResponse(Response response)
{
    var output = new MemoryStream();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    var bytes = output.ToArray();
    var responseXml = Encoding.UTF8.GetString(bytes);
    return responseXml;
}

If it's just a matter of the xml incorrectly containing the BOM, then I'll switch to

var responseXml = new UTF8Encoding(false).GetString(bytes);

but it was not clear at all from my research that the BOM was illegal in the actual XML string; see e.g. c# Detect xml encoding from Byte Array?



Solution 1:[1]

In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

So you want to prevent the BOM from being added as part of your serialization process. Unfortunately, you don't provide what your serialization logic is.

What you should do is provide a UTF8Encoding instance created via the UTF8Encoding(bool) constructor to disable generation of the BOM, and pass this Encoding instance to whichever methods you're using which are generating your intermediate string.

Solution 2:[2]

The BOM shouldn't be in the string in the first place.
BOMs are used to detect the encoding of a raw byte array; they have no business being in an actual string.

What does the string come from?
You're probably reading it with the wrong encoding.

Solution 3:[3]

Strings in C# are encoded as UTF-16, so the BOM would be wrong. As a general rule, always encode XML to byte arrays and decode it from byte arrays.

Solution 4:[4]

Here is how to convert the MemoryStream to a string that can be used with XmlDocument (the Skip function is Linq):

    public static string Decode(MemoryStream ms)
    {
      var fileBytes = ms.ToArray();
      var isUnicode = fileBytes[0] == 0xff && fileBytes[1] == 0xfe;  //UTF-16 little endian
      var isUtf8Bom = fileBytes[0] == 0xef && fileBytes[1] == 0xbb && fileBytes[2] == 0xbf;

      string xml = isUnicode ? Encoding.Unicode.GetString(fileBytes) : 
                   (isUtf8Bom ? new UTF8Encoding(false, true).GetString(fileBytes.Skip(3).ToArray()) : new UTF8Encoding(false, true).GetString(fileBytes));

       return xml;
    } 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jonp
Solution 2 SLaks
Solution 3 Stephen Cleary
Solution 4 Gerhard Schmeusser