This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Is it OK to write ASCII strings directly into locale source files?


On 07/25/2017 10:12 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> On 07/25/2017 02:20 AM, Mike FABIAN wrote:
>>> Carlos O'Donell <carlos@redhat.com> wrote:
>>>
>>>> My only argument is that when you are forced to use <Uxxx> encoding it
>>>> is empirically less likely you'll make a mistake. Like reading a sentence
>>>> backwards to catch errors since it prevents your brain from filling in
>>>> the missing information.
>>>
>>> But there are also many mistakes because somebody mistyped code points.
>>> Several weird typos in things like month names look as if somebody
>>> mistyped code points.
>>
>> Ultimately I defer to your judgement as localedata maintainer to create
>> a workflow that is easy for you and benefits your work.
>>
>> However, I caution against throwing away the compatibility of our locales
>> with POSIX, which doesn't seem to allow UTF-8 in the specification.
> 
> It does, to some extent:
> 
> | A character in the portable character set can be represented by the
> | character itself, in which case the value of the character is
> | implementation-defined. (Implementations may allow other characters
> | to be represented as themselves, but such locale definitions are not
> | portable.)
> 
> You'll need a very hostile interpretation to say that this doesn't
> allow multi-byte character sequences in localedef input.

I see what you're saying, which is that we are *still* POSIX comliant,
but not portable?

I assume we are focusing on the "()" text which allows some kind of escape
hatch outside of the portable character set and allow us to use UTF-8?

> But I found this in the guts of localedef:
> 
> 	      /* The standards leave it up to the implementation to decide
> 		 what to do with character which stand for themself.  We
> 		 could jump through hoops to find out the value relative to
> 		 the charmap and the repertoire map, but instead we leave
> 		 it up to the locale definition author to write a better
> 		 definition.  We assume here that every character which
> 		 stands for itself is encoded using ISO 8859-1.  Using the
> 		 escape character is allowed.  */
> 
> So we currently hard-code ISO 8859-1 (not UTF-8) to avoid the
> bootstrapping problem.
 
We could just assume UTF-8, but yes, it looks like this needs a little bit
more looking into.

Either way, I support using the portable character set today, and that's
a step forward.

-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]