Bug 30243

Summary: Add full support for GB18030-2022
Product: glibc Reporter: starcold14 <starcold14>
Component: localeAssignee: Mike FABIAN <maiku.fabian>
Status: UNCONFIRMED ---    
Severity: critical CC: bruno, carlos, jamborm, lijianglin2, liqingqing3, maiku.fabian, matz
Priority: P1    
Version: 2.39   
Target Milestone: 2.39   
Host: Target:
Build: Last reconfirmed:
Attachments: mapping tables
Part 1 of a draft proposed fix
Part 2 of a draft proposed fix

Description starcold14 2023-03-18 17:05:59 UTC
GB18030-2022 is coming and glic use GB18030-2005 standard which should be updated!
The changes of GB18030-2022 can be found at Doctor Ken Lunde's article(https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132).
Comment 1 starcold14 2023-03-18 17:12:45 UTC
https://bugs.openjdk.org/browse/JDK-8301119?attachmentViewMode=gallery
JDK has updated the standard. Following above link gets the completed charmap.
Comment 2 Bruno Haible 2023-05-20 22:52:10 UTC
Created attachment 14890 [details]
mapping tables

The official GB18030-2022 mapping table can be downloaded from http://www.nits.org.cn/index/article/4034 (two data files).

The difference between GB18030-2005 and GB18030-2022, regarding the mapping tables, is that GB18030-2022 gets rid of a PUA (private use area) mapping of some characters that were not part of Unicode in 2005 but are in Unicode nowadays. In other words, these PUA mappings are considered obsolete.

Find attached a tar file with
1) The current mapping tables from glibc (extracted from glibc 2.35, but it hasn't changed since then),
2) The mapping tables from GNU libiconv.
The <encoding>.TXT files describe the multibyte to Unicode conversion direction; the <encoding>.INVERSE.TXT files describe the Unicode to multibyte conversion direction.

What needs to be done in glibc?

* For the multibyte to Unicode conversion direction: Look at "diff -u glibc-2.35-iconv/GB18030.TXT libiconv/GB18030-2022.TXT"
  - Mappings for 0x82359037..0x82359134 and 0x84318236..0x84318335 need to be added.
  - The mappings of 0xFE51, 0xFE52, 0xFE53, 0xFE6C, 0xFE76, 0xFE91 need to be changed.

* For the Unicode to multibyte conversion direction: Look at "diff -u glibc-2.35-iconv/GB18030.INVERSE.TXT libiconv/GB18030-2022.INVERSE.TXT"
  - Mappings for U+E81E, U+E826, U+E82B, U+E82C, U+E832, U+E843, U+E854, U+E864, U+E78D..U+E796 need to be added.
  - The mappings of U+20087, U+20089, U+200CC, U+215D7, U+2298F, U+241FE need to be changed.
  - Mappings for U+E816, U+E817, U+E818, U+E831, U+E83B, U+E855 need to be added.
Comment 3 Andreas Schwab 2023-05-21 11:22:17 UTC
Feel free to propose a patch.
Comment 4 liqingqing 2023-06-16 01:28:09 UTC
we have finished a patch for this, please review, thanks!
https://patchwork.sourceware.org/project/glibc/patch/20230615113800.2174-1-lijianglin2@huawei.com/
Comment 5 Bruno Haible 2023-06-16 01:54:14 UTC
Created attachment 14930 [details]
Part 1 of a draft proposed fix
Comment 6 Bruno Haible 2023-06-16 01:54:49 UTC
Created attachment 14931 [details]
Part 2 of a draft proposed fix
Comment 7 Bruno Haible 2023-06-16 02:01:55 UTC
I think
1) It would be useful to change the unit tests to test the entire GB18030 charmap, not only the BMP part. Find attached a draft patch to that effect. I'm not happy with that; I intend to simplify it more.
2) My patch (part 2) also removes a few lines of "/* Handle a few special cases.  */" in iconvdata/gb18030.c. Yours doesn't.
3) The comment "The newest GB 18030-2005 standard still uses some private use area code points. ..." in localedata/charmaps/GB18030 should be removed, since it does not reflect reality any more. GB 18030-2005 is no longer the newest one. The newest one, from 2022, dropped the particular use of private use area code points.
Comment 8 lijianglin 2023-06-25 06:21:18 UTC
(In reply to Bruno Haible from comment #7)
> I think
> 1) It would be useful to change the unit tests to test the entire GB18030
> charmap, not only the BMP part. Find attached a draft patch to that effect.
> I'm not happy with that; I intend to simplify it more.
> 2) My patch (part 2) also removes a few lines of "/* Handle a few special
> cases.  */" in iconvdata/gb18030.c. Yours doesn't.
> 3) The comment "The newest GB 18030-2005 standard still uses some private
> use area code points. ..." in localedata/charmaps/GB18030 should be removed,
> since it does not reflect reality any more. GB 18030-2005 is no longer the
> newest one. The newest one, from 2022, dropped the particular use of private
> use area code points.

Thank you for your suggestion. I will include these in my patch
Comment 9 lijianglin 2023-06-27 06:15:11 UTC
we have updated the patch, please review, thanks!
https://patchwork.sourceware.org/project/glibc/patch/20230627034706.3053-1-lijianglin2@huawei.com/
Comment 10 lijianglin 2023-06-28 08:45:58 UTC
(In reply to lijianglin from comment #9)
> we have updated the patch, please review, thanks!
> https://patchwork.sourceware.org/project/glibc/patch/20230627034706.3053-1-
> lijianglin2@huawei.com/

we have adjusted localedata/charmaps/GB18030,the latest patch(v3) as follows
https://patchwork.sourceware.org/project/glibc/patch/20230627121549.3431-1-lijianglin2@huawei.com/
Comment 11 liqingqing 2023-06-28 14:33:06 UTC
(In reply to Bruno Haible from comment #7)
> I think
> 1) It would be useful to change the unit tests to test the entire GB18030
> charmap, not only the BMP part. Find attached a draft patch to that effect.
> I'm not happy with that; I intend to simplify it more.
> 2) My patch (part 2) also removes a few lines of "/* Handle a few special
> cases.  */" in iconvdata/gb18030.c. Yours doesn't.
> 3) The comment "The newest GB 18030-2005 standard still uses some private
> use area code points. ..." in localedata/charmaps/GB18030 should be removed,
> since it does not reflect reality any more. GB 18030-2005 is no longer the
> newest one. The newest one, from 2022, dropped the particular use of private
> use area code points.

ok, thanks!
Comment 12 lijianglin 2023-07-03 08:09:12 UTC
(In reply to lijianglin from comment #10)
> (In reply to lijianglin from comment #9)
> > we have updated the patch, please review, thanks!
> > https://patchwork.sourceware.org/project/glibc/patch/20230627034706.3053-1-
> > lijianglin2@huawei.com/
> 
> we have adjusted localedata/charmaps/GB18030,the latest patch(v3) as follows
> https://patchwork.sourceware.org/project/glibc/patch/20230627121549.3431-1-
> lijianglin2@huawei.com/

Does anyone have focus on this patch?
Comment 13 Michael Matz 2023-08-16 13:26:03 UTC
Could this be given some attention please?  AFAICS the last version of the
patch didn't receive further suggestions, so maybe it's good to go?  (Background:
people here are starting to worry about gb18030-2022 conformance, and ideally we
want to include/backport only something that upstream also has)
Comment 14 Carlos O'Donell 2023-08-16 19:20:02 UTC
(In reply to Michael Matz from comment #13)
> Could this be given some attention please?  AFAICS the last version of the
> patch didn't receive further suggestions, so maybe it's good to go? 
> (Background:
> people here are starting to worry about gb18030-2022 conformance, and
> ideally we
> want to include/backport only something that upstream also has)

I agree completely that this needs review.

I reviewed v1, Andreas reviewed v2. The patch is currently at v3, and that version needs review:
https://patchwork.sourceware.org/project/glibc/patch/20230627121549.3431-1-lijianglin2@huawei.com/
Comment 15 Mike FABIAN 2023-08-16 20:22:34 UTC
I tested v3 a while ago (2023-07-26). I think it is good. 


I could compile and install it. I wrote a small python test program to test iconv with and without the patch.

https://mfabian.fedorapeople.org/misc/iconv-test.py

 The patch made all the codepoints mentioned in the tables in

https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf

work correctly for the 2022 version of the standard.

Without the patch, only the two byte GB18030 code points in the Annex A table work, the 4 byte GB18030 codepoints work only with the patch. 

The patch is also needed to make the conversions in the Annex B and Annex C tables work.

By the way, the Annex C table in the above PDF contains a typo in the first column, last row: it should be U+8612, not U+8162.
Comment 16 Carlos O'Donell 2023-08-16 20:28:08 UTC
(In reply to Mike FABIAN from comment #15)
> I tested v3 a while ago (2023-07-26). I think it is good. 
> By the way, the Annex C table in the above PDF contains a typo in the first
> column, last row: it should be U+8612, not U+8162.

Does v3 correct this problem?
Comment 17 liqingqing 2023-08-17 01:55:00 UTC
(In reply to Carlos O'Donell from comment #16)
> (In reply to Mike FABIAN from comment #15)
> > I tested v3 a while ago (2023-07-26). I think it is good. 
> > By the way, the Annex C table in the above PDF contains a typo in the first
> > column, last row: it should be U+8612, not U+8162.
> 
> Does v3 correct this problem?

(In reply to Mike FABIAN from comment #15)
> I tested v3 a while ago (2023-07-26). I think it is good. 
> 
> 
> I could compile and install it. I wrote a small python test program to test
> iconv with and without the patch.
> 
> https://mfabian.fedorapeople.org/misc/iconv-test.py
> 
>  The patch made all the codepoints mentioned in the tables in
> 
> https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf
> 
> work correctly for the 2022 version of the standard.
> 
> Without the patch, only the two byte GB18030 code points in the Annex A
> table work, the 4 byte GB18030 codepoints work only with the patch. 
> 
> The patch is also needed to make the conversions in the Annex B and Annex C
> tables work.
> 
> By the way, the Annex C table in the above PDF contains a typo in the first
> column, last row: it should be U+8612, not U+8162.

yes, you are right, I checked the GB18030-2022 standard,
http://c.gb688.cn/bzgk/gb/showGb?type=online&hcno=A1931A578FE14957104988029B0833D3

page67 list the below code point: 
unicode code point U+8612 <--> GB18030 code point: CC55

so, the above PDF is  wrong, it should be U+8612
Comment 18 Mike FABIAN 2023-08-17 10:25:38 UTC
(In reply to Carlos O'Donell from comment #16)
> (In reply to Mike FABIAN from comment #15)
> > I tested v3 a while ago (2023-07-26). I think it is good. 
> > By the way, the Annex C table in the above PDF contains a typo in the first
> > column, last row: it should be U+8612, not U+8162.
> 
> Does v3 correct this problem?

Yes, v3 is correct.
Comment 19 Mike FABIAN 2023-08-29 17:04:24 UTC
Patch v3 pushed to git master.
Comment 20 Carlos O'Donell 2023-08-29 17:31:13 UTC
(In reply to Mike FABIAN from comment #19)
> Patch v3 pushed to git master.

Thank you!

I set Target Milestone to 2.39 so it shows up in the 2.39 NEWS bug list.