This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Is it OK to write ASCII strings directly into locale source files?


* Carlos O'Donell:

> On 07/24/2017 01:05 PM, Florian Weimer wrote:
>> * Andreas Schwab:
>> 
>>> On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote:
>>>
>>>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
>>>> the start of a code point and > the end of the code point.
>>>
>>> POSIX says "character in the portable character set" if you want to keep
>>> it portable.
>> 
>> But our locales only have to be compatible with our localedef, right?
>
> Should developers be able to write tools to the POSIX locale spec and parse
> our source locale definitions? Supporting more than just GNU/Linux? Do the
> BSDs share our locale definitions?

No, they don't.  For one thing, they have partially implemented %OB
(without fixing all the locales, creating inconsistencies).

> My only technical objection with writing straight UTF-8 is that it could
> lead to more mistakes, and Mike just found one in CLDR where an Arabic
> Farsi character was used incorrectly because it displayed the same glyph.
> It was caught when harmonizing with glibc where you have to write out the
> code points (Mike filed a bug upstream with CLDR).

Wasn't it caught by locale testing which revealed that the locale
wasn't compatible with ISO-8859-6?  That sanity check would still
apply to locale definitions written in UTF-8.

If we are worried about this kind of problem, I think web browsers
have multi-script detection logic to deal with cross-script homographs
in IDNA labels.  I don't know how hard it would be to extract that
logic from there and run it on locale strings, for cross-verification.

> My preference would be to start small, start using the POSIX portable
> character set to it's maximum extent for all latin-based languages,

I would still prefer the <U…> encoding for control characters which
are in the portable character set.  So I have to object to the
“maximum” part. :)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]