This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Locales: Use CLDR matching thousands separator


On 10/8/18 2:51 PM, Florian Weimer wrote:
> * Marko Myllynen:
> 
>> One perhaps related thing I noticed recently was that neither U+00A0 or
>> U+202F are classified as whitespace characters. locales/i18n_ctype has
>> this definition (based on ISO/IEC 30112, see
>> http://www.open-std.org/jtc1/sc35/wg5/docs/30112d10.pdf document page 30):
>>
>> space /
>>    <U0009>..<U000D>;<U0020>;<U1680>;<U2000>..<U2006>;<U2008>..<U200A>;/
>>    <U2028>..<U2029>;<U205F>;<U3000>
>>
>> Looking at pages about whitespace characters
>> (https://en.wikipedia.org/wiki/Whitespace_character) and Unicode spaces
>> (http://jkorpela.fi/chars/spaces.html) it seems that a couple of other
>> Unicode space characters are also omitted from that list.
>>
>> Does anyone know is there a particular reason to omit U+00A0 and U+202F
>> and few others from the above classification?
> 
> I think it is deliberate to get the right behavior from line-breaking
> algorithms.
> 

I forgot we implement unicode_utils.py...

317 def is_space(code_point):
318     '''Checks whether the character with this code point is a space'''
319     # Don’t make U+00A0 a space. Non-breaking space means that all programs
320     # should treat it like a punctuation character, not like a space.
321     return (code_point == 0x0020 # ' '
322             or code_point == 0x000C # '\f'
323             or code_point == 0x000A # '\n'
324             or code_point == 0x000D # '\r'
325             or code_point == 0x0009 # '\t'
326             or code_point == 0x000B # '\v'
327             # Categories Zl, Zp, and Zs without mention of "<noBreak>"
328             or (UNICODE_ATTRIBUTES[code_point]['name']
329                 and
330                 (UNICODE_ATTRIBUTES[code_point]['category'] in ['Zl', 'Zp']
331                  or
332                  (UNICODE_ATTRIBUTES[code_point]['category'] in ['Zs']
333                   and
334                   '<noBreak>' not in
335                   UNICODE_ATTRIBUTES[code_point]['decomposition']))))


And the properties support this:

DerivedCoreProperties.txt:00A0          ; Grapheme_Base # Zs       NO-BREAK SPACE
EastAsianWidth.txt:00A0;N           # Zs         NO-BREAK SPACE
PropList.txt:00A0          ; White_Space # Zs       NO-BREAK SPACE
UnicodeData.txt:00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
UnicodeData.txt:100A0;LINEAR B IDEOGRAM B151 HORN;Lo;0;L;;;;;N;;;;;

So it's a <noBreak> space.

-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]