In the en_GB.UTF-8 locale $ perl -C -le 'print chr$_ for 0x1D400..0x1D419' | sort -u 𝐀 That's not fine (bug18927), but expected. The order of those 𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙 mathematical letter characters is not defined and sort the same. Now, in a zh_CN.gb18030 locale: $ LC_ALL=zh_CN.gb18030 locale charmap GB18030 $ perl -C -le 'print chr$_ for 0x1D400..0x1D419' | (export LC_ALL=zh_CN.gb18030; iconv -f utf-8 | sort -u | iconv -t utf-8) 𝐈 𝐉 𝐀 𝐁 𝐂 𝐃 𝐄 𝐅 𝐆 𝐇 (where sort is GNU sort which uses strcoll). If we look at the strxfrm() output of the first few letters, we see: $ export LC_ALL=zh_CN.gb18030 $ ./strxfrm $'\U1D400' | od -An -vtx1 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6 $ ./strxfrm $'\U1D401' | od -An -vtx1 05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6 [...] $ ./strxfrm $'\U1D409' | od -An -vtx1 05 03 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6 $ ./strxfrm $'\U1D40a' | od -An -vtx1 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6 $ ./strxfrm $'\U1D40b' | od -An -vtx1 05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6 where strxfrm.c is #include <locale.h> #include <string.h> #include <stdio.h> int main(int argc, char* argv[]) { char buf[4096]; setlocale(LC_ALL, ""); strxfrm(buf, argv[1], sizeof(buf)); printf("%s", buf); return 0; } There are 10 different strxfrm() outcomes until it loops back to the beginning. If we look at those characters: $ printf '\U1D400' | od -An -vtx1 -vtc 94 33 8a 32 224 3 212 2 $ printf '\U1D401' | od -An -vtx1 -vtc 94 33 8a 33 224 3 212 3 $ printf '\U1D40a' | od -An -vtx1 -vtc 94 33 8b 32 224 3 213 2 See how the last byte of both U+1D400 and U+1D40A is 0x32, the encoding of "2". The strxfrm of "2" is: $ ./strxfrm 2 | od -An -vtx1 04 01 09 01 09 Which we find in the strxfrm of U+1D400/U+1D40A 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6 04 09 09 The strxfrm() of U+1D400 looks like the strxfrm() of a string of two characters. As if strxfrm considered U+1D400 was the concatenation of something and "2". Note that mbtowc() is ok with those characters: $ ./mbtowc $'\U1D400' 4 0X1D400 Where mbtowc.c is: #include <stdlib.h> #include <stdio.h> #include <string.h> #include <locale.h> int main(int argc, char* argv[]) { wchar_t c; int i; setlocale(LC_ALL, ""); for (i = 1; i < argc; i++) { int n; n = mbtowc(&c, argv[i], strlen(argv[i])); printf("%d %#X\n", n, c); } } It's not limited to those characters. It seems to be the case for many (over one million) characters whose encoding ends in the encoding of a digit but not all. For instance, not for U+00C3 (and a few thousand others). It's only for characters whose encoding ends in the encoding of a digit. I could not reproduce it with any other character encoding.
FWIW, the issue is still present in 2.37 though not on those mathematical letters which now seem to have an order. $ perl -C -le 'print chr$_ for 0x1f9d0..0x1f9df' | (export LC_ALL=zh_CN.gb18030; iconv -f utf-8 | sort -u | iconv -t utf-8) 🧘 🧙 🧐 🧑 🧒 🧓 🧔 🧕 🧖 🧗