Summary: | Add full support for GB18030-2022 | ||
---|---|---|---|
Product: | glibc | Reporter: | starcold14 <starcold14> |
Component: | locale | Assignee: | Mike FABIAN <maiku.fabian> |
Status: | UNCONFIRMED --- | ||
Severity: | critical | CC: | bruno, carlos, jamborm, lijianglin2, liqingqing3, maiku.fabian, matz |
Priority: | P1 | ||
Version: | 2.39 | ||
Target Milestone: | 2.39 | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: |
mapping tables
Part 1 of a draft proposed fix Part 2 of a draft proposed fix |
Description
starcold14
2023-03-18 17:05:59 UTC
https://bugs.openjdk.org/browse/JDK-8301119?attachmentViewMode=gallery JDK has updated the standard. Following above link gets the completed charmap. Created attachment 14890 [details] mapping tables The official GB18030-2022 mapping table can be downloaded from http://www.nits.org.cn/index/article/4034 (two data files). The difference between GB18030-2005 and GB18030-2022, regarding the mapping tables, is that GB18030-2022 gets rid of a PUA (private use area) mapping of some characters that were not part of Unicode in 2005 but are in Unicode nowadays. In other words, these PUA mappings are considered obsolete. Find attached a tar file with 1) The current mapping tables from glibc (extracted from glibc 2.35, but it hasn't changed since then), 2) The mapping tables from GNU libiconv. The <encoding>.TXT files describe the multibyte to Unicode conversion direction; the <encoding>.INVERSE.TXT files describe the Unicode to multibyte conversion direction. What needs to be done in glibc? * For the multibyte to Unicode conversion direction: Look at "diff -u glibc-2.35-iconv/GB18030.TXT libiconv/GB18030-2022.TXT" - Mappings for 0x82359037..0x82359134 and 0x84318236..0x84318335 need to be added. - The mappings of 0xFE51, 0xFE52, 0xFE53, 0xFE6C, 0xFE76, 0xFE91 need to be changed. * For the Unicode to multibyte conversion direction: Look at "diff -u glibc-2.35-iconv/GB18030.INVERSE.TXT libiconv/GB18030-2022.INVERSE.TXT" - Mappings for U+E81E, U+E826, U+E82B, U+E82C, U+E832, U+E843, U+E854, U+E864, U+E78D..U+E796 need to be added. - The mappings of U+20087, U+20089, U+200CC, U+215D7, U+2298F, U+241FE need to be changed. - Mappings for U+E816, U+E817, U+E818, U+E831, U+E83B, U+E855 need to be added. Feel free to propose a patch. we have finished a patch for this, please review, thanks! https://patchwork.sourceware.org/project/glibc/patch/20230615113800.2174-1-lijianglin2@huawei.com/ Created attachment 14930 [details]
Part 1 of a draft proposed fix
Created attachment 14931 [details]
Part 2 of a draft proposed fix
I think 1) It would be useful to change the unit tests to test the entire GB18030 charmap, not only the BMP part. Find attached a draft patch to that effect. I'm not happy with that; I intend to simplify it more. 2) My patch (part 2) also removes a few lines of "/* Handle a few special cases. */" in iconvdata/gb18030.c. Yours doesn't. 3) The comment "The newest GB 18030-2005 standard still uses some private use area code points. ..." in localedata/charmaps/GB18030 should be removed, since it does not reflect reality any more. GB 18030-2005 is no longer the newest one. The newest one, from 2022, dropped the particular use of private use area code points. (In reply to Bruno Haible from comment #7) > I think > 1) It would be useful to change the unit tests to test the entire GB18030 > charmap, not only the BMP part. Find attached a draft patch to that effect. > I'm not happy with that; I intend to simplify it more. > 2) My patch (part 2) also removes a few lines of "/* Handle a few special > cases. */" in iconvdata/gb18030.c. Yours doesn't. > 3) The comment "The newest GB 18030-2005 standard still uses some private > use area code points. ..." in localedata/charmaps/GB18030 should be removed, > since it does not reflect reality any more. GB 18030-2005 is no longer the > newest one. The newest one, from 2022, dropped the particular use of private > use area code points. Thank you for your suggestion. I will include these in my patch we have updated the patch, please review, thanks! https://patchwork.sourceware.org/project/glibc/patch/20230627034706.3053-1-lijianglin2@huawei.com/ (In reply to lijianglin from comment #9) > we have updated the patch, please review, thanks! > https://patchwork.sourceware.org/project/glibc/patch/20230627034706.3053-1- > lijianglin2@huawei.com/ we have adjusted localedata/charmaps/GB18030,the latest patch(v3) as follows https://patchwork.sourceware.org/project/glibc/patch/20230627121549.3431-1-lijianglin2@huawei.com/ (In reply to Bruno Haible from comment #7) > I think > 1) It would be useful to change the unit tests to test the entire GB18030 > charmap, not only the BMP part. Find attached a draft patch to that effect. > I'm not happy with that; I intend to simplify it more. > 2) My patch (part 2) also removes a few lines of "/* Handle a few special > cases. */" in iconvdata/gb18030.c. Yours doesn't. > 3) The comment "The newest GB 18030-2005 standard still uses some private > use area code points. ..." in localedata/charmaps/GB18030 should be removed, > since it does not reflect reality any more. GB 18030-2005 is no longer the > newest one. The newest one, from 2022, dropped the particular use of private > use area code points. ok, thanks! (In reply to lijianglin from comment #10) > (In reply to lijianglin from comment #9) > > we have updated the patch, please review, thanks! > > https://patchwork.sourceware.org/project/glibc/patch/20230627034706.3053-1- > > lijianglin2@huawei.com/ > > we have adjusted localedata/charmaps/GB18030,the latest patch(v3) as follows > https://patchwork.sourceware.org/project/glibc/patch/20230627121549.3431-1- > lijianglin2@huawei.com/ Does anyone have focus on this patch? Could this be given some attention please? AFAICS the last version of the patch didn't receive further suggestions, so maybe it's good to go? (Background: people here are starting to worry about gb18030-2022 conformance, and ideally we want to include/backport only something that upstream also has) (In reply to Michael Matz from comment #13) > Could this be given some attention please? AFAICS the last version of the > patch didn't receive further suggestions, so maybe it's good to go? > (Background: > people here are starting to worry about gb18030-2022 conformance, and > ideally we > want to include/backport only something that upstream also has) I agree completely that this needs review. I reviewed v1, Andreas reviewed v2. The patch is currently at v3, and that version needs review: https://patchwork.sourceware.org/project/glibc/patch/20230627121549.3431-1-lijianglin2@huawei.com/ I tested v3 a while ago (2023-07-26). I think it is good. I could compile and install it. I wrote a small python test program to test iconv with and without the patch. https://mfabian.fedorapeople.org/misc/iconv-test.py The patch made all the codepoints mentioned in the tables in https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf work correctly for the 2022 version of the standard. Without the patch, only the two byte GB18030 code points in the Annex A table work, the 4 byte GB18030 codepoints work only with the patch. The patch is also needed to make the conversions in the Annex B and Annex C tables work. By the way, the Annex C table in the above PDF contains a typo in the first column, last row: it should be U+8612, not U+8162. (In reply to Mike FABIAN from comment #15) > I tested v3 a while ago (2023-07-26). I think it is good. > By the way, the Annex C table in the above PDF contains a typo in the first > column, last row: it should be U+8612, not U+8162. Does v3 correct this problem? (In reply to Carlos O'Donell from comment #16) > (In reply to Mike FABIAN from comment #15) > > I tested v3 a while ago (2023-07-26). I think it is good. > > By the way, the Annex C table in the above PDF contains a typo in the first > > column, last row: it should be U+8612, not U+8162. > > Does v3 correct this problem? (In reply to Mike FABIAN from comment #15) > I tested v3 a while ago (2023-07-26). I think it is good. > > > I could compile and install it. I wrote a small python test program to test > iconv with and without the patch. > > https://mfabian.fedorapeople.org/misc/iconv-test.py > > The patch made all the codepoints mentioned in the tables in > > https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf > > work correctly for the 2022 version of the standard. > > Without the patch, only the two byte GB18030 code points in the Annex A > table work, the 4 byte GB18030 codepoints work only with the patch. > > The patch is also needed to make the conversions in the Annex B and Annex C > tables work. > > By the way, the Annex C table in the above PDF contains a typo in the first > column, last row: it should be U+8612, not U+8162. yes, you are right, I checked the GB18030-2022 standard, http://c.gb688.cn/bzgk/gb/showGb?type=online&hcno=A1931A578FE14957104988029B0833D3 page67 list the below code point: unicode code point U+8612 <--> GB18030 code point: CC55 so, the above PDF is wrong, it should be U+8612 (In reply to Carlos O'Donell from comment #16) > (In reply to Mike FABIAN from comment #15) > > I tested v3 a while ago (2023-07-26). I think it is good. > > By the way, the Annex C table in the above PDF contains a typo in the first > > column, last row: it should be U+8612, not U+8162. > > Does v3 correct this problem? Yes, v3 is correct. Patch v3 pushed to git master. (In reply to Mike FABIAN from comment #19) > Patch v3 pushed to git master. Thank you! I set Target Milestone to 2.39 so it shows up in the 2.39 NEWS bug list. |