Bug 22889 - strcoll/strxfrm broken for most characters in GB18030
Summary: strcoll/strxfrm broken for most characters in GB18030
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.37
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-24 20:52 UTC by Stephane Chazelas
Modified: 2023-11-19 09:58 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stephane Chazelas 2018-02-24 20:52:18 UTC
In the en_GB.UTF-8 locale

$ perl -C -le 'print chr$_ for 0x1D400..0x1D419' | sort -u
𝐀

That's not fine (bug18927), but expected. The order of those 𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙 mathematical letter characters is not defined and sort the same.

Now, in a zh_CN.gb18030 locale:

$ LC_ALL=zh_CN.gb18030 locale charmap
GB18030
$ perl -C -le 'print chr$_ for 0x1D400..0x1D419' |
   (export LC_ALL=zh_CN.gb18030; iconv -f utf-8 | sort -u | iconv -t utf-8)
𝐈
𝐉
𝐀
𝐁
𝐂
𝐃
𝐄
𝐅
𝐆
𝐇

(where sort is GNU sort which uses strcoll). If we look at the strxfrm() output of the first few letters, we see:

$ export LC_ALL=zh_CN.gb18030
$ ./strxfrm $'\U1D400' | od -An -vtx1
 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D401' | od -An -vtx1
 05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
[...]
$ ./strxfrm $'\U1D409' | od -An -vtx1
 05 03 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D40a' | od -An -vtx1
 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D40b' | od -An -vtx1
 05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6

where strxfrm.c is

#include <locale.h>
#include <string.h>
#include <stdio.h>

int main(int argc, char* argv[])
{
  char buf[4096];
  setlocale(LC_ALL, "");
  strxfrm(buf, argv[1], sizeof(buf));
  printf("%s", buf);
  return 0;
}

There are 10 different strxfrm() outcomes until it loops back to the beginning.

If we look at those characters:

$ printf '\U1D400' | od -An -vtx1 -vtc
  94  33  8a  32
 224   3 212   2
$ printf '\U1D401' | od -An -vtx1 -vtc
  94  33  8a  33
 224   3 212   3
$ printf '\U1D40a' | od -An -vtx1 -vtc
  94  33  8b  32
 224   3 213   2

See how the last byte of both U+1D400 and U+1D40A is 0x32, the encoding of "2".

The strxfrm of "2" is:

$ ./strxfrm 2 | od -An -vtx1
 04 01 09 01 09

Which we find in the strxfrm of U+1D400/U+1D40A

 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
    04       09       09

The strxfrm() of U+1D400 looks like the strxfrm() of a string of two characters. As if strxfrm considered U+1D400 was the concatenation of something and "2".

Note that mbtowc() is ok with those characters:

$ ./mbtowc $'\U1D400'
4 0X1D400

Where mbtowc.c is:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, char* argv[])
{
  wchar_t c;
  int i;

  setlocale(LC_ALL, "");
  for (i = 1; i < argc; i++) {
    int n;
    n = mbtowc(&c, argv[i], strlen(argv[i]));
    printf("%d %#X\n", n, c);
  }
}

It's not limited to those characters. It seems to be the case for many (over one million) characters whose encoding ends in the encoding of a digit but not all. For instance, not for U+00C3 (and a few thousand others). It's only for characters whose encoding ends in the encoding of a digit. I could not reproduce it with any other character encoding.
Comment 1 Stephane Chazelas 2023-11-19 09:56:38 UTC
FWIW, the issue is still present in 2.37 though not on those mathematical letters which now seem to have an order.

$ perl -C -le 'print chr$_ for 0x1f9d0..0x1f9df' | (export LC_ALL=zh_CN.gb18030; iconv -f utf-8 | sort -u | iconv -t utf-8)
🧘
🧙
🧐
🧑
🧒
🧓
🧔
🧕
🧖
🧗