[PATCH v3] Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120]

Carlos O'Donell carlos@redhat.com
Thu Jun 25 12:35:00 GMT 2020


On 6/25/20 4:32 AM, Mike FABIAN wrote:
> Carlos O'Donell <carlos@redhat.com> さんはかきました:
> 
>> On 6/23/20 5:30 AM, Mike FABIAN via Libc-alpha wrote:
>>> I skipped unassigned characters and ended the range at U+D7FF even
>>> though U+D7FC .. U+D7FF are currently unassigned. But because
>>> the script now skips the unassigned characters it is OK to end the range
>>> for the Hangul Jamo at U+D7FF, if these characters ever happen to get
>>> assigned in future, they will probably be Hangul Jamo because of
>>> Block.txt.
>>>
>>> After each Unicode update, manual checking is good anyway, but ending
>>> the range in the script at U+D7FF seems more likely to do the right
>>> thing already if these characters ever get assigned.
>>>
>>
>> You change the generator but all the files that are generated by the
>> generator do not appear regenerated in your patch.
> 
>> Can you please post exactly what you plan to commit, that way we can
>> review the results?
> 
> The patch did contain everything.

My expectation was that we do this:

(a) Make all your source changes.
(b) Regenerate.
cd ~/src/glibc/localedata/unicode-gen
make
make-intstall

When you do it this way *all* files get updates.

You're cheating here because you know apriori which files need updating
and you appear to have carried out that updated by hand. I dislike doing
this kind of "by hand" process and was expecting the more general update
as I describe above. This ensures that all files are at the same "regenerated"
state and the date for all of them in their datestamps matches the date
you regenerated everything that is generated by unicode-gen. That means a
reviewer doesn't have to verify that no other files might have been changed
by your code change.

I'm requesting that we follow a simpler procedure that regenerates the files
using the expected process, and post patches based on that.

Does that clarify my position?

Please feel free to justify a different position.
 
>> I'm expecting:
>> - generator change.
> 
> This part is the generator change:
> 
> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
> index 17b99ee88d..11c906b92f 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py
> @@ -258,7 +258,13 @@ def process_width(outfile, ulines, elines, plines):
>          if key in width_dict:
>              del width_dict[key] # default width is 1
>      for key in list(range(0x1160, 0x1200)):
> -        width_dict[key] = 0
> +        # Hangul jungseong and jongseong:
> +        if key in unicode_utils.UNICODE_ATTRIBUTES:
> +            width_dict[key] = 0
> +    for key in list(range(0xD7B0, 0xD800)):
> +        # Hangul jungseong and jongseong:
> +        if key in unicode_utils.UNICODE_ATTRIBUTES:
> +            width_dict[key] = 0

OK.

>      for key in list(range(0x3248, 0x3250)):
>          # These are “A” which means we can decide whether to treat them
>          # as “W” or “N” based on context:
> @@ -327,6 +333,7 @@ if __name__ == "__main__":
>          help='The Unicode version of the input files used.')
>      ARGS = PARSER.parse_args()
>  
> +    unicode_utils.fill_attributes(ARGS.unicode_data_file)

OK.

>      with open(ARGS.unicode_data_file, mode='r') as UNIDATA_FILE:
>          UNICODE_DATA_LINES = UNIDATA_FILE.readlines()
>      with open(ARGS.east_asian_with_file, mode='r') as EAST_ASIAN_WIDTH_FILE:
> 
>> - all files updated with date changes.
> 
> And the UTF-8 file in charmaps is the only file which changed, only in
> the WIDTH section:

If you ran make; make install; from unicode-gen all the other files would have
changed their timestamps to show that they were updated and had no functional
change. Those patches are missing. Without them I don't know that the generator
actually generated "zero functional changes" for those files.

> diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
> index 14c5d4fa33..8cce47cd97 100644
> --- a/localedata/charmaps/UTF-8
> +++ b/localedata/charmaps/UTF-8
> @@ -48920,6 +48920,8 @@ WIDTH
>  <UABE8>        0
>  <UABED>        0
>  <UAC00>...<UD7A3>      2
> +<UD7B0>...<UD7C6>      0
> +<UD7CB>...<UD7FB>      0

OK.

>  <UF900>...<UFA6D>      2
>  <UFA70>...<UFAD9>      2
>  <UFB1E>        0
> 
> 
>> - some files have more than date changes.
> 
> No other files are changed and the UTF-8 file in charmaps does not
> contain a generation date.
> 
>> This way we keep the generated files consistent.
> 


-- 
Cheers,
Carlos.



More information about the Libc-alpha mailing list