This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Is it OK to write ASCII strings directly into locale source files?

From: Florian Weimer <fw at deneb dot enyo dot de>
To: Carlos O'Donell <carlos at redhat dot com>
Cc: Andreas Schwab <schwab at suse dot de>, Mike FABIAN <mfabian at redhat dot com>, libc-alpha at sourceware dot org
Date: Mon, 24 Jul 2017 23:13:13 +0200
Subject: Re: Is it OK to write ASCII strings directly into locale source files?
Authentication-results: sourceware.org; auth=none
References: <s9d8tje9e1k.fsf@redhat.com> <5f71f2f6-be0e-2b5d-91ce-03386eafa7f7@redhat.com> <mvmy3rdx577.fsf@suse.de> <87h8y13gvb.fsf@mid.deneb.enyo.de> <e43a088a-cb33-c322-7587-c20d993e7fa6@redhat.com>

* Carlos O'Donell:

> On 07/24/2017 01:05 PM, Florian Weimer wrote:
>> * Andreas Schwab:
>> 
>>> On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote:
>>>
>>>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
>>>> the start of a code point and > the end of the code point.
>>>
>>> POSIX says "character in the portable character set" if you want to keep
>>> it portable.
>> 
>> But our locales only have to be compatible with our localedef, right?
>
> Should developers be able to write tools to the POSIX locale spec and parse
> our source locale definitions? Supporting more than just GNU/Linux? Do the
> BSDs share our locale definitions?

No, they don't.  For one thing, they have partially implemented %OB
(without fixing all the locales, creating inconsistencies).

> My only technical objection with writing straight UTF-8 is that it could
> lead to more mistakes, and Mike just found one in CLDR where an Arabic
> Farsi character was used incorrectly because it displayed the same glyph.
> It was caught when harmonizing with glibc where you have to write out the
> code points (Mike filed a bug upstream with CLDR).

Wasn't it caught by locale testing which revealed that the locale
wasn't compatible with ISO-8859-6?  That sanity check would still
apply to locale definitions written in UTF-8.

If we are worried about this kind of problem, I think web browsers
have multi-script detection logic to deal with cross-script homographs
in IDNA labels.  I don't know how hard it would be to extract that
logic from there and run it on locale strings, for cross-verification.

> My preference would be to start small, start using the POSIX portable
> character set to it's maximum extent for all latin-based languages,

I would still prefer the <U…> encoding for control characters which
are in the portable character set.  So I have to object to the
“maximum” part. :)

Follow-Ups:
- Re: Is it OK to write ASCII strings directly into locale source files?
  - From: Rafal Luzynski
- Re: Is it OK to write ASCII strings directly into locale source files?
  - From: Carlos O'Donell

References:
- Is it OK to write ASCII strings directly into locale source files?
  - From: Mike FABIAN
- Re: Is it OK to write ASCII strings directly into locale source files?
  - From: Carlos O'Donell
- Re: Is it OK to write ASCII strings directly into locale source files?
  - From: Andreas Schwab
- Re: Is it OK to write ASCII strings directly into locale source files?
  - From: Florian Weimer
- Re: Is it OK to write ASCII strings directly into locale source files?
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]