This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Locales: Use CLDR matching thousands separator


Hi,

On 2018-10-08 21:55, Carlos O'Donell wrote:
> On 10/8/18 2:51 PM, Florian Weimer wrote:
>> * Marko Myllynen:
>>
>>> One perhaps related thing I noticed recently was that neither U+00A0 or
>>> U+202F are classified as whitespace characters. locales/i18n_ctype has
>>> this definition (based on ISO/IEC 30112, see
>>> http://www.open-std.org/jtc1/sc35/wg5/docs/30112d10.pdf document page 30):
>>>
>>> space /
>>>    <U0009>..<U000D>;<U0020>;<U1680>;<U2000>..<U2006>;<U2008>..<U200A>;/
>>>    <U2028>..<U2029>;<U205F>;<U3000>
>>>
>>> Looking at pages about whitespace characters
>>> (https://en.wikipedia.org/wiki/Whitespace_character) and Unicode spaces
>>> (http://jkorpela.fi/chars/spaces.html) it seems that a couple of other
>>> Unicode space characters are also omitted from that list.
>>>
>>> Does anyone know is there a particular reason to omit U+00A0 and U+202F
>>> and few others from the above classification?
>>
>> I think it is deliberate to get the right behavior from line-breaking
>> algorithms.
> 
> I forgot we implement unicode_utils.py...
> 
> 317 def is_space(code_point):
> 318     '''Checks whether the character with this code point is a space'''
> 319     # Don’t make U+00A0 a space. Non-breaking space means that all programs
> 320     # should treat it like a punctuation character, not like a space.
> 
> And the properties support this:
> 
> DerivedCoreProperties.txt:00A0          ; Grapheme_Base # Zs       NO-BREAK SPACE
> EastAsianWidth.txt:00A0;N           # Zs         NO-BREAK SPACE
> PropList.txt:00A0          ; White_Space # Zs       NO-BREAK SPACE
> UnicodeData.txt:00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
> UnicodeData.txt:100A0;LINEAR B IDEOGRAM B151 HORN;Lo;0;L;;;;;N;;;;;

Thanks, so the above conforms to ISO/IEC 30112 and is deliberately
omitting no-break spaces, at least for line-breaking algorithms.

I wonder would it make sense to consider an addition to provide a
function that would recognize non-breaking spaces, to allow applications
to detect something like "123 456" as a single number, regardless of the
exact variant of no-break space used in between.

I'm not sure on what level such an addition should ideally be done
(standards, GNU/glibc, a higher-level library, perhaps something else),
so I won't start championing such a change further.

Thanks,

-- 
Marko Myllynen


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]