22889 – strcoll/strxfrm broken for most characters in GB18030

Bug 22889 - strcoll/strxfrm broken for most characters in GB18030

Summary: strcoll/strxfrm broken for most characters in GB18030

Status:	UNCONFIRMED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	locale (show other bugs)
Version:	2.37

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-02-24 20:52 UTC by Stephane Chazelas
Modified:	2023-11-19 09:58 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stephane Chazelas 2018-02-24 20:52:18 UTC

In the en_GB.UTF-8 locale

$ perl -C -le 'print chr$_ for 0x1D400..0x1D419' | sort -u
𝐀

That's not fine (bug18927), but expected. The order of those 𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙 mathematical letter characters is not defined and sort the same.

Now, in a zh_CN.gb18030 locale:

$ LC_ALL=zh_CN.gb18030 locale charmap
GB18030
$ perl -C -le 'print chr$_ for 0x1D400..0x1D419' |
   (export LC_ALL=zh_CN.gb18030; iconv -f utf-8 | sort -u | iconv -t utf-8)
𝐈
𝐉
𝐀
𝐁
𝐂
𝐃
𝐄
𝐅
𝐆
𝐇

(where sort is GNU sort which uses strcoll). If we look at the strxfrm() output of the first few letters, we see:

$ export LC_ALL=zh_CN.gb18030
$ ./strxfrm $'\U1D400' | od -An -vtx1
 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D401' | od -An -vtx1
 05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
[...]
$ ./strxfrm $'\U1D409' | od -An -vtx1
 05 03 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D40a' | od -An -vtx1
 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D40b' | od -An -vtx1
 05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6

where strxfrm.c is

#include <locale.h>
#include <string.h>
#include <stdio.h>

int main(int argc, char* argv[])
{
  char buf[4096];
  setlocale(LC_ALL, "");
  strxfrm(buf, argv[1], sizeof(buf));
  printf("%s", buf);
  return 0;
}

There are 10 different strxfrm() outcomes until it loops back to the beginning.

If we look at those characters:

$ printf '\U1D400' | od -An -vtx1 -vtc
  94  33  8a  32
 224   3 212   2
$ printf '\U1D401' | od -An -vtx1 -vtc
  94  33  8a  33
 224   3 212   3
$ printf '\U1D40a' | od -An -vtx1 -vtc
  94  33  8b  32
 224   3 213   2

See how the last byte of both U+1D400 and U+1D40A is 0x32, the encoding of "2".

The strxfrm of "2" is:

$ ./strxfrm 2 | od -An -vtx1
 04 01 09 01 09

Which we find in the strxfrm of U+1D400/U+1D40A

 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
    04       09       09

The strxfrm() of U+1D400 looks like the strxfrm() of a string of two characters. As if strxfrm considered U+1D400 was the concatenation of something and "2".

Note that mbtowc() is ok with those characters:

$ ./mbtowc $'\U1D400'
4 0X1D400

Where mbtowc.c is:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, char* argv[])
{
  wchar_t c;
  int i;

  setlocale(LC_ALL, "");
  for (i = 1; i < argc; i++) {
    int n;
    n = mbtowc(&c, argv[i], strlen(argv[i]));
    printf("%d %#X\n", n, c);
  }
}

It's not limited to those characters. It seems to be the case for many (over one million) characters whose encoding ends in the encoding of a digit but not all. For instance, not for U+00C3 (and a few thousand others). It's only for characters whose encoding ends in the encoding of a digit. I could not reproduce it with any other character encoding.

Comment 1 Stephane Chazelas 2023-11-19 09:56:38 UTC

FWIW, the issue is still present in 2.37 though not on those mathematical letters which now seem to have an order.

$ perl -C -le 'print chr$_ for 0x1f9d0..0x1f9df' | (export LC_ALL=zh_CN.gb18030; iconv -f utf-8 | sort -u | iconv -t utf-8)
🧘
🧙
🧐
🧑
🧒
🧓
🧔
🧕
🧖
🧗