'SQL - UTF-8 to varchar/nvarchar Encoding issue

Problem background - I am receiving response data from a website, formatted in json and UTF-8 encoded. A body attribute of json has values in a base64binary type, and I store it as nvarchar on ms sql server.

That base64binary data when converted to varchar or nvarchar contains funny characters(in places of double quotes etc.), indicating that there is an encoding issue - This is reason I post this question.

See dissected code as follows and a runnable example at the bottom, with comments.

Notice funny characters during conversion.

eg. On behalf of IRB Holding Corp (the “Company")

The following query fixes above issue - I see quotes as they should appear, but then it fails on rows containing '&' , which is a special character in xml.

select    convert(xml,  '<?xml version="1.0" encoding="UTF-8"?>' + convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)')))

The following query handles above issue by using replace statements and I am able to completely see all the rows as expected. But this solution will only handle the '&' s.

Example code to run:

    declare @t table ( [body] nvarchar(max) ) 
    
    insert into @t(body) 
    select 'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA=='
    
    select convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))
        , convert(xml, '<?xml version="1.0" encoding="UTF-8"?>'+ replace(convert(varchar(max),convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))),'&','&amp;')) 
from @t

The question - Will I have to add more replace statements for other xml special characters - < , > ?



Solution 1:[1]

The XML trick works fine, just let the XML engine handle the character entities:

declare @t table ([body] nvarchar(max));

insert into @t(body) 
values ('REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==');

select
    cast(
        cast('<?xml version="1.0" encoding="UTF-8"?><root><![CDATA[' as varbinary(max))
        +
        CAST('' as xml).value('xs:base64Binary(sql:column("body"))', 'VARBINARY(MAX)')
        +
        cast(']]></root>' as varbinary(max))
    as xml).value('.', 'nvarchar(max)')
from
@t;

The important parts here are:

  • The absence of N in front of the string literals
  • The encoding="UTF-8"
  • The fact that we know that the characters from the XML declaration element have the same UTF-8 representation as they do in latin1, so casting them to varbinary gives valid UTF-8
  • The <![CDATA]]> block.

Note that it is still no more than a hack. As soon as you involve XML, you are subject to the XML limitations, and if your string contains characters not representable in XML, that type of XML conversion is going to fail with

XML parsing: line 1, character 54, illegal xml character

Solution 2:[2]

UPDATE: I just learnd something new, which is - uhm - great :-)

Try this function

CREATE FUNCTION dbo.Convert_utf8(@utf8 VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @rslt NVARCHAR(MAX);

    SELECT @rslt=
    CAST(
          --'<?xml version="1.0" encoding="UTF-8"?><![CDATA['
          0x3C3F786D6C2076657273696F6E3D22312E302220656E636F64696E673D225554462D38223F3E3C215B43444154415B
          --the content goes within CDATA
        + @utf8
        --']]>'
        + 0x5D5D3E
    AS XML).value('.', 'nvarchar(max)');

    RETURN @rslt;
END
GO

And call it like this

SELECT *
      ,dbo.Convert_utf8(CAST(t.body AS XML).value('.','varbinary(max)'))
FROM @t t;

The result is

DALLAS, TX – May 7, 2019 – Covey & Park Energy Holdings LLC (“Covey Park” 

GSerg, thank you very much ! for your answer below. I tried around and simplified this to work within an UDF.

It looks as if the cast of a varbinary(max) to XML is completely done in a CLR environment, where the XML's encoding declaration is taken into account. This seems to work with other encodings too, but I don't have the time right now, to test this generically.

Now the rest of the answer

As it contains some background about string encoding, which might be worth to read.

I simplified your code a bit:

declare @t table ( [body] nvarchar(max) ) 

insert into @t(body) 
select 'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==';

SELECT  CAST(t.body AS XML).value('.','varbinary(max)')
       ,CAST(CAST(t.body AS XML).value('.','varbinary(max)') AS VARCHAR(MAX))
FROM @t t;

You will see this result

0x44414C4C41532C20545820E28093204D617920372C203230313920E2809320436F7665792026205061726B20456E6572677920486F6C64696E6773204C4C432028E2809C436F766579205061726BE2809D20  
DALLAS, TX – May 7, 2019 – Covey & Park Energy Holdings LLC (“Covey Park†

I'll place the first characters more reader friendly

0x44414C4C41532C20545820E28093  
   D A L L A S ,   T X   â € “ 

The 0x44 is the D, twice the 0x4C is the doubled LL, and after the space 0x20 we get to E28093. This is a 3-byte encoded code point for the en dash. SQL-Server will not help you with this... It will interpret this into 3 characters of 1 byte each...

I'm afraid, you'r out of luck...

SQL-Server does not support utf-8 strings. There is limited support with BCP / BULK to enable input from the file system, but a string within T-SQL must be one of the two supported options:

  • (var)char, which is extended ASCII. It is strictly one-byte-per-character and will need a collation to deal with a limited set of foreign characters.
  • n(var)char, which is UCS-2 (very similar to UTF-16). It is strictly two-bytes-per-character and will encode (almost) any known character at the price of doubled size in memory.

UTF-8 is compatible with (var)char, as long as we stick with plain latin and to one-byte-codes. But any ASCII code above 127 will lead into troubles (might work with the right collation). But - this is your case here - your string uses multi-byte-code-points. UTF-8 will encode a lot of characters with two or even more bytes (up to 4!) for one single character.

What you can do

You will have to use some engine capable to deal with UTF-8

  • a CLR-function
  • Export to a file and re-import using the limited support (needs v2014 SP2 or higher)
  • Use an external tool (PowerShell, C#, any programming language you know)

And - thx to @GSerg - two more options:

  • Wait for v2019. There will be special collations allowing for native support of utf-8 in T-SQL-strings
  • This answer provides an UDF, which can transform UTF8 to NVARCHAR. It won't be fast, but it works.

General remark

A database can hold storage-data just as is, or working-data, you want to use in the one or the other way. Storing a picture as VARBINARY(MAX) is just a chunk of bits. You would not try to use SQL-Server to perform image reckognition.

This is the same with text data. If you just store a chunk of text, it won't matter, how you do this. But if you want to use this text for filtering, searching or if you want to use SQL-Server to display this text, you must think about the format and the needs for performance.

An enocding with variable byte lengths will not allow a simple SUBSTRING('blahblah',2,3). With fixed length the engine can just take the string as an array, jump to the second index and pick the next three characters. But with variable bytes the engine will have to compute the index by checking all characters before, if there might be any multi-byte code point. This will slow down a lot of string methods extremly...

Best was, not to store data in a format, the SQL-Server cannot handle (well)...

Solution 3:[3]

If you had SQL server 2019, you could create another database with UTF8 as default collation and create simple function there:

USE UTF8_DATABASE
GO

CREATE OR ALTER FUNCTION dbo.VarBinaryToUTF8
  (@UTF8 VARBINARY(MAX))
  RETURNS VARCHAR(MAX)
AS
BEGIN
  RETURN CAST(@UTF8 AS VARCHAR(MAX));
END;

You would than call

SELECT
  UTF8_DATABASE.dbo.VarBinaryToUTF8
  (
    CAST('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)')
  )
FROM
  @t

This works because SQL server uses the default collation of specific database for its variables and function return values. You have to store the result into NVARCHAR or UTF8 collated 'VARCHAR in your non-UTF8` database.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Community
Solution 3 andowero