[PATCH v3] Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120]

Thu Jun 25 08:32:06 GMT 2020

Carlos O'Donell <carlos@redhat.com> さんはかきました:

> On 6/23/20 5:30 AM, Mike FABIAN via Libc-alpha wrote:
>> I skipped unassigned characters and ended the range at U+D7FF even
>> though U+D7FC .. U+D7FF are currently unassigned. But because
>> the script now skips the unassigned characters it is OK to end the range
>> for the Hangul Jamo at U+D7FF, if these characters ever happen to get
>> assigned in future, they will probably be Hangul Jamo because of
>> Block.txt.
>> 
>> After each Unicode update, manual checking is good anyway, but ending
>> the range in the script at U+D7FF seems more likely to do the right
>> thing already if these characters ever get assigned.
>> 
>
> You change the generator but all the files that are generated by the
> generator do not appear regenerated in your patch.

> Can you please post exactly what you plan to commit, that way we can
> review the results?

The patch did contain everything.

> I'm expecting:
> - generator change.

This part is the generator change:

diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 17b99ee88d..11c906b92f 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -258,7 +258,13 @@ def process_width(outfile, ulines, elines, plines):
         if key in width_dict:
             del width_dict[key] # default width is 1
     for key in list(range(0x1160, 0x1200)):
-        width_dict[key] = 0
+        # Hangul jungseong and jongseong:
+        if key in unicode_utils.UNICODE_ATTRIBUTES:
+            width_dict[key] = 0
+    for key in list(range(0xD7B0, 0xD800)):
+        # Hangul jungseong and jongseong:
+        if key in unicode_utils.UNICODE_ATTRIBUTES:
+            width_dict[key] = 0
     for key in list(range(0x3248, 0x3250)):
         # These are “A” which means we can decide whether to treat them
         # as “W” or “N” based on context:
@@ -327,6 +333,7 @@ if __name__ == "__main__":
         help='The Unicode version of the input files used.')
     ARGS = PARSER.parse_args()
 
+    unicode_utils.fill_attributes(ARGS.unicode_data_file)
     with open(ARGS.unicode_data_file, mode='r') as UNIDATA_FILE:
         UNICODE_DATA_LINES = UNIDATA_FILE.readlines()
     with open(ARGS.east_asian_with_file, mode='r') as EAST_ASIAN_WIDTH_FILE:

> - all files updated with date changes.

And the UTF-8 file in charmaps is the only file which changed, only in
the WIDTH section:

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0


> - some files have more than date changes.

No other files are changed and the UTF-8 file in charmaps does not
contain a generation date.

> This way we keep the generated files consistent.

-- 
Mike FABIAN <mfabian@redhat.com>
睡眠不足はいい仕事の敵だ。