'Why doesn't Git natively support UTF-16?
Solution 1:[1]
The first mention of UTF-8 in the Git codebase dates back to d4a9ce7 (Aug. 2005, v0.99.6), which was about mailingbox patches:
Optionally, with the '-u' flag, the output to
.infoand.msgis transliterated from its original chaset [sic] to utf-8. This is to encourage people to use utf8 in their commit messages for interoperability.
This was signed by Junio C Hamano / ?? ? <[email protected]>.
The character encoding was clarified in commit 3a59e59 (July 2017, Git v2.6.0-rc0):
That "git is encoding agnostic" is only really true for blob objects. E.g. the 'non-NUL bytes' requirement of tree and commit objects excludes UTF-16/32, and the special meaning of '
/' in the index file as well as space and linefeed in commit objects eliminates EBCDIC and other non-ASCII encoding.Git expects
bytes < 0x80to be pure ASCII, thus CJK encoding that partly overlap with the ASCII range are problematic as well. E.g.fmt_ident()removes trailing 0x5C from usernames on the assumption that it is ASCII '\'. However, there are over 200 GBK double byte codes that end in 0x5C.UTF-8 as default encoding on Linux and respective path translations in the Mac and Windows versions have established UTF-8 NFC as de facto standard for path names.
See "git, msysgit, accents, utf-8, the definitive answers" for more on that last patch.
The most recent version of Documentation/i18n.txt includes:
Git is to some extent character encoding agnostic.
The contents of the blob objects are uninterpreted sequences of bytes.
There is no encoding translation at the core level.Path names are encoded in UTF-8 normalization form C.
This applies to:
- tree objects,
- the index file,
- ref names, as well as path names in
- command line arguments,
- environment variables and
- configuration files (
.git/config,gitignore,gitattributesandgitmodules)
You can see an example of UTF-8 path conversion in commit 0217569 (Jan. 2012, Git v2.1.0-rc0), which added Win32 Unicode file name support.
Changes
opendir/readdirto use Windows Unicode APIs and convert between UTF-8/UTF-16.
Regarding command-line arguments, cf. commit 3f04614 (Jan. 2011, Git v2.1.0-rc0), which converts command line arguments from UTF-16 to UTF-8 on startup.
Note: before Git 2.21 (Feb. 2019) the code and tests assume that the system supplied iconv() would always use a BOM in its output when asked to encode to UTF-16 (or UTF-32), but apparently some implementations output big-endian without BOM.
A compile-time knob has been added to help such systems (e.g. NonStop) to add BOM to the output to increase portability.
See commit 79444c9 (12 Feb 2019) by brian m. carlson (bk2204).
(Merged by Junio C Hamano -- gitster -- in commit 18f9fb6, 13 Feb 2019)
utf8: handle systems that don't write BOM for UTF-16
When serializing UTF-16 (and UTF-32), there are three possible ways to write the stream. One can write the data with a BOM in either big-endian or little-endian format, or one can write the data without a BOM in big-endian format.
Most systems'
iconvimplementations choose to write it with a BOM in some endianness, since this is the most foolproof, and it is resistant to misinterpretation on Windows, where UTF-16 and the little-endian serialization are very common. For compatibility with Windows and to avoid accidental misuse there, Git always wants to write UTF-16 with a BOM, and will refuse to read UTF-16 without it.However, musl's
iconvimplementation writes UTF-16 without a BOM, relying on the user to interpret it as big-endian. This causes t0028 and the related functionality to fail, since Git won't read the file without a BOM.
So the "compile-time knob" added here is in the Makefile:
# Define ICONV_OMITS_BOM if your iconv implementation does not write a
# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
# big-endian format.
#
ifdef ICONV_OMITS_BOM
BASIC_CFLAGS += -DICONV_OMITS_BOM
endif
Since a NonStop OS and its associated NonStop SQL product always use UTF-16BE (16-bit) encoding for the Unicode (UCS2) character set, you can use ICONV_OMITS_BOM in that environment.
Solution 2:[2]
Git recently has begun to understand encodings such as UTF-16. See gitattributes documentation—search for working-tree-encoding.
If you want .txt files to be UTF-16 without a BOM on Windows machine then add this to your gitattributes file:
*.txt text working-tree-encoding=UTF-16LE eol=CRLF
In response to jthill's comments:
There isn't any doubt that UTF-16 is a mess. However, consider
Solution 3:[3]
Git support for UTF-16 is coming... for environment variables, with Git 2.20 (Q4 2018)
(and a bug fix in Git 2.21: see the second part of the answer)
See commit fe21c6b, commit 665177e (30 Oct 2018) by Johannes Schindelin (dscho).
Helped-by: Jeff Hostetler (jeffhostetler).
(Merged by Junio C Hamano -- gitster -- in commit 0474cd1, 13 Nov 2018)
mingw: reencode environment variables on the fly (UTF-16 <-> UTF-8)On Windows, the authoritative environment is encoded in UTF-16.
In Git for Windows, we convert that to UTF-8 (because UTF-16 is such a foreign idea to Git that its source code is unprepared for it).Previously, out of performance concerns, we converted the entire environment to UTF-8 in one fell swoop at the beginning, and upon
putenv()andrun_command()converted it back.Having a private copy of the environment comes with its own perils: when a library used by Git's source code tries to modify the environment, it does not really work (in Git for Windows' case,
libcurl, seegit-for-windows/git/compare/bcad1e6d58^...bcad1e6d58^2for a glimpse of the issues).Hence, it makes our environment handling substantially more robust if we switch to on-the-fly-conversion in
getenv()/putenv()calls.
Based on an initial version in the MSVC context by Jeff Hostetler, this patch makes it so.Surprisingly, this has a positive effect on speed: at the time when the current code was written, we tested the performance, and there were so many
getenv()calls that it seemed better to convert everything in one go.
In the meantime, though, Git has obviously been cleaned up a bit with regards togetenv()calls so that the Git processes spawned by the test suite use an average of only 40getenv()/putenv()calls over the process lifetime.Speaking of the entire test suite: the total time spent in the re-encoding in the current code takes about 32.4 seconds (out of 113 minutes runtime), whereas the code introduced in this patch takes only about 8.2 seconds in total.
Not much, but it proves that we need not be concerned about the performance impact introduced by this patch.
With Git 2.21 (Q1 2019), the previous path introduced a bug which affected the GIT_EXTERNAL_DIFF command: the string
returned from getenv() to be non-volatile, which is not true, that
has been corrected.
See commit 6776a84 (11 Jan 2019) by Kim Gybels (Jeff-G).
(Merged by Junio C Hamano -- gitster -- in commit 6a015ce, 29 Jan 2019)
The bug was reported in git-for-windows/git issue 2007:
"Unable to Use difftool on More than 8 File"
$ yes n | git -c difftool.prompt=yes difftool fe21c6b285df fe21c6b285df~100
Viewing (1/404): '.gitignore'
Launch 'bc3' [Y/n]?
Viewing (2/404): 'Documentation/.gitignore'
[...]
Viewing (8/404): 'Documentation/RelNotes/2.18.1.txt'
Launch 'bc3' [Y/n]?
Viewing (9/404): 'Documentation/RelNotes/2.19.0.txt'
Launch 'bc3' [Y/n]? error: cannot spawn ¦?: No such file or directory
fatal: external diff died, stopping at Documentation/RelNotes/2.19.1.txt
Hence:
diff: ensure correct lifetime ofexternal_diff_cmdAccording to getenv(3)'s notes:
The implementation of
getenv()is not required to be reentrant.
The string pointed to by the return value ofgetenv()may be statically allocated, and can be modified by a subsequent call togetenv(),putenv(3),setenv(3), orunsetenv(3).Since strings returned by
getenv()are allowed to change on subsequent calls togetenv(), make sure to duplicate when cachingexternal_diff_cmdfrom environment.This problem becomes apparent on Git for Windows since fe21c6b (
mingw: reencode environment variables on the fly (UTF-16 <-> UTF-8)), when thegetenv()implementation provided incompat/mingw.cwas changed to keep a certain amount of alloc'ed strings and freeing them on subsequent calls.
Git 2.24 (Q4 2019) fix a hack introduced previously.
See commit 2049b8d, commit 97fff61 (30 Sep 2019) by Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit 772cad0, 09 Oct 2019)
Move
git_sort(), a stablesort, intolibgit.aThe
qsort()function is not guaranteed to be stable, i.e. it does not promise to maintain the order of items it is told to consider equal.
In contrast, thegit_sort()function we carry incompat/qsort.cis stable, by virtue of implementing a merge sort algorithm.In preparation for using a stable sort in Git's rename detection, move the stable sort into
libgit.aso that it is compiled in unconditionally, and rename it togit_stable_qsort().Note: this also makes the hack obsolete that was introduced in fe21c6b (
mingw: reencode environment variables on the fly (UTF-16 <-> UTF-8), 2018-10-30, Git v2.20.0-rc0), where we includedcompat/qsort.cdirectly incompat/mingw.cto use the stable sort.
Solution 4:[4]
The short form is adding support for wide characters makes everything harder. Everything that deals with any of the 8-bit ISO code pages or UTF-8 or any of the other MBCS's can scan/span/copy strings without much effort. Try to add support for strings whose transfer encoding contains embedded nulls and the complications to even trivial operations start bloating all your code.
I don't know of any even claimed advantages to UTF-16 that aren't more than undone by the downsides that show up when you actually start using it. You can identify a string boundary in any of ASCII, UTF-8, all 16 ISO/IEC-8859 sets, all the EBCDICs, plus probably a dozen more, with the same simple code. With only slight restrictions (ascii-based, with a few lines added for multiple line terminator conventions) you get basic tokenization, and transliteration to a common internal code page is basically free.
Add UTF-16 support and you just bought yourself a huge amount of added effort and complexity, but all that work enables nothing -- after saying "oh, but now it can handle UTF-16!", what else is now possible with all that added bloat and effort? Nothing. Everything UTF-16 can do, UTF-8 can do as well and usually much better.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Peter Mortensen |
| Solution 3 | |
| Solution 4 |
