This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [Bug localedata/22387] Replace unicode sequences <Uxxxx> for characters inside the ASCII printable range

From: Keld Simonsen <keld at keldix dot com>
To: carlos at redhat dot com <sourceware-bugzilla at sourceware dot org>
Cc: libc-locales at sourceware dot org
Date: Wed, 8 Nov 2017 01:54:04 +0200
Subject: Re: [Bug localedata/22387] Replace unicode sequences <Uxxxx> for characters inside the ASCII printable range
Authentication-results: sourceware.org; auth=none
References: <bug-22387-716@http.sourceware.org/bugzilla/> <bug-22387-716-zQzEl5jTq7@http.sourceware.org/bugzilla/>

Hi

I am not sure if I can write a very strong case against using ASCII in strings.
I hav no practical experience with problems, but I see a number of possible
conflicts. You could also say that because of the design used until now,
where all our locales have been character coding independent, we have
not seen any problems!

I am the editor of ISO 14652 and ISO 30112, and of the 100 pages annex in
the POSIX standard that originally introduced the codeset independent locales
for POSIX and thus Linux. 14652 and 30112 are the standards that define many
of the extensions from POSIX that we use in glibc for i18n and l10n.
>From an architectual view I would really like that we keep glibc locales
character coding independent, so our locales can be used without change 
on all systems that adheres to those standards.

But even if we restrict ourselves to only look at glibc implementations,
using coding dependent locales may cause problems. Not on everyday Linux
systems, were we mostly operate in UTF-8, and sometimes in other coded
character sets, but on other systems. 

gcc and glibc is probably the most ported C compiler and C library in the world.
Some of the platforms it has been ported to run in non-ascii compatible
environments, I think this includes

    MS windows, which uses UTF-16, and which now includes an Ubuntu system
    MAC OS/ IOS , which uses UTF16, and where gcc/glibc ports exists
    EBCDIC machines, where gcc/glibc ports exist - they run many banking and aviation systems
    Embedded systems where many kinds of character sets are used.
    Older systems in Eastern Asia, still using older Eastern Asia 14-bit character sets.

I do see the need for better looking locales, They would be easer to write and debug.
Thereofre I propose that we use the mnemonics defined in ISO 14652/ISO 30112 at least
for the ASCII characters. These were also used in the original locales that I wrote
and Ulrich Drepper used for his initial work for glibc. At some point Ulrich decided
to use Uxxxx mnemonics, which made locales more unreadable. I do agree that using Uxxxx
is a good solution for the characters that are not known to everybody, such as Chinese,
Korean and Japanese characters. This gives a chance to everybody in the world to
work on locales using these characters, which actually in our moderne world means
all locales in the world, as we all may use full UTF-8 or the like.

best regards
Keld

On Fri, Nov 03, 2017 at 03:12:25AM +0000, carlos at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=22387
> 
> Carlos O'Donell <carlos at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |carlos at redhat dot com
> 
> --- Comment #8 from Carlos O'Donell <carlos at redhat dot com> ---
> (In reply to keld@keldix.com from comment #7)
> > I think we should not do this, as it would make locales unusable
> > with ebcdic encodings. I am also unsure how it will work with utf-16.
> 
> Please provide a justification for this requirement to support EBCDIC and
> UTF-16, included systems that would be impacted today by this change.
> 
> I spoke with Ulrich Drepper directly, and he did point out that the design idea
> behind using <Uxxxx> sequences was indeed to support the locales on systems
> that had other encodings like EBCDIC, but with the rise of UTF-8 as the defacto
> standard, no such systems have really materialized.
> 
> > I propose you use better mnemonics for the ascii range, such as <a> for a,
> > etc.  That is, use the mnemonics defined in the POSIX standard for the ascii
> > range.
> 
> I disagree strongly with this, why use '<a>' instead of 'a'? Please provide
> strong rationale for why we should keep using the <Uxxxx> format.
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.

References:
- [Bug localedata/22387] New: Replace unicode sequences <Uxxxx> for characters inside the ASCII printable range
  - From: claude at 2xlibre dot net
- [Bug localedata/22387] Replace unicode sequences <Uxxxx> for characters inside the ASCII printable range
  - From: carlos at redhat dot com

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]