'Detecting 'text' file type (ANSI vs UTF-8)

I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI.

Someone has ported the program to run on the Internet, probably using Java, and the resulting text file is of type UTF-8.

The program which reads these results files will have to read both the files created by Delphi and the files created via the Internet.

Whilst I can convert the UTF-8 text to ANSI (using the cunningly named function UTF8ToANSI), how can I tell in advance which kind of file I have?

Seeing as I 'own' the file format, I suppose the easiest way to deal with this would be to place a marker within the file at a known position which will tell me the source of the program (Delphi/Internet), but this seems to be cheating.

Thanks in advance.



Solution 1:[1]

There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.)

For instance, the sequence C4 8D might be the “?” character in UTF-8, or it might be “Ä?” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “?š” in windows-1250, but it is not a valid UTF-8 string.

You have to resort to some kind of heuristic, e.g.

  1. If the file contains a sequence which cannot be a valid UTF-8, assume it is ANSI.
  2. Otherwise, if the file begins with UTF-8 BOM (EF BB BF), assume it is UTF-8 (it might not be, however, plain text ANSI file beginning with such characters is very improbable).
  3. Otherwise, assume it is UTF-8. (Or, try more heuristics, maybe using the knowledge of the language of the text, etc.)

See also the method used by Notepad.

Solution 2:[2]

If we summerize, then:

  • Best solution for basic usage is to use outdated ( if we use IsTextUnicode(); );
  • Best solution for advanced usage is to use function above, then check BOM ( ~ 1KB ), then check Locale info under particual OS and only then get about 98% accuracy?

OTHER INFO PEOPLE MAY FOUND INTERESTING:

https://groups.google.com/forum/?lnk=st&q=delphi+WIN32+functions+to+detect+which+encoding++is+in+use&rnum=1&hl=pt-BR&pli=1#!topic/borland.public.delphi.internationalization.win32/_LgLolX25OA

function FileMayBeUTF8(FileName: WideString): Boolean;
var
 Stream: TMemoryStream;
 BytesRead: integer;
 ArrayBuff: array[0..127] of byte;
 PreviousByte: byte;
 i: integer;
 YesSequences, NoSequences: integer;

begin
   if not WideFileExists(FileName) then
     Exit;
   YesSequences := 0;
   NoSequences := 0;
   Stream := TMemoryStream.Create;
   try
     Stream.LoadFromFile(FileName);
     repeat

     {read from the TMemoryStream}

       BytesRead := Stream.Read(ArrayBuff, High(ArrayBuff) + 1);
           {Do the work on the bytes in the buffer}
       if BytesRead > 1 then
         begin
           for i := 1 to BytesRead-1 do
             begin
               PreviousByte := ArrayBuff[i-1];
               if ((ArrayBuff[i] and $c0) = $80) then
                 begin
                   if ((PreviousByte and $c0) = $c0) then
                     begin
                       inc(YesSequences)
                     end
                   else
                     begin
                       if ((PreviousByte and $80) = $0) then
                         inc(NoSequences);
                     end;
                 end;
             end;
         end;
     until (BytesRead < (High(ArrayBuff) + 1));
//Below, >= makes ASCII files = UTF-8, which is no problem.
//Simple > would catch only UTF-8;
     Result := (YesSequences >= NoSequences);

   finally
     Stream.Free;
   end;
end;

Now testing this function...

In my humble opinion only way how to START doing this check correctly is to check OS charset in first place because in the end there almost in all cases are made some references to OS. No way to scape it anyway...

Remarks:

Solution 3:[3]

When reading first try parsing the file as UTF-8. If it isn't valid UTF-8 interpret the file as the legacy encoding(ANSI). This will work on most files, since it's very unlikely that a legacy encoded file will be valid UTF-8.

What windows calls ANSI is a system locale dependent charset. And the text won't work correctly on a Russian, or Asian or... windows.

While the VCL doesn't support Unicode in Delphi 7, you still should internally work with unicode and only convert to ANSI for displaying it. I localized one of my programs to Korean and Russian, and that was the only way I got it working without large problems. You still could only display the Korean localization on a system set to Korean, but at least the text-files could be edited on any system.

Solution 4:[4]

//if is possible to decoded,then it is UTF8

function isFileUTF8(const Tex : AnsiString): boolean;
begin
  result := (Tex <> '') and (UTF8Decode(Tex) <> '');
end;

Solution 5:[5]

Forget BOM and other advice. Here's what I found and keep for reference:

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'."

Source.

By the way, you're mostly on your own. The knowledge of codepages, UTF etc. isn't that good in the West, so the quality of advice is similarly... questionable.

Solution 6:[6]

As others said, there is no perfect way. You have to use heuristics. Here is a method I use which provide good results, assuming you already know ASCII charset (eg: ISO-8859-1 or Windows-1252):

  1. Check if there is a BOM header. if yes, it's UTF-8.
  2. Check if there is any character that is higher than 0x80 (except 0xA0 which is NBSP). If there isn't any, it's ASCII.
  3. Open the file as UTF-8. Check if all characters are within charset (eg: ISO-8859-1). If not, it's probably ASCII not UTF-8 (eg : if you got ?, since it's not part of ISO-8859-1, it's probably ASCII).

If you don't know the charset in advance: follow step 1 and 2. For step 3 : open file in ASCII with different charsets (and as UTF-8). For each result, perform tests and calculate a score/confidence. Take the one that fits the best. This is how Notepad++ try to detect text encoding. See here and here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2
Solution 3 CodesInChaos
Solution 4 N3R4ZZuRR0
Solution 5 Michał Leon
Solution 6