This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Should glibc provide a builtin C.UTF-8 locale?


On Tue, Oct 27, 2015 at 01:22:34PM +0100, Mike FABIAN wrote:
> keld@keldix.com wrote:
> 
> > On Wed, Oct 21, 2015 at 02:34:22PM -0400, Carlos O'Donell wrote:
> >> On 10/21/2015 01:49 PM, Mike Frysinger wrote:
> >>
> >> Looks good to me.
> >> 
> >> Fedora support is here:
> >> http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?id=0457f649e3fe6299efe384da13dfc923bbe65707
> >> 
> >> Patch for C.UTF-8:
> >> https://lists.fedoraproject.org/pipermail/glibc/2015-September/000081.html
> >> 
> >> The patch lists some if the differences between debian and fedora.
> >> 
> >> We are testing C.UTF-8 in rawhide and I expect Mike FABIAN wants to
> >> submit this upstream at some point to become the official C.UTF-8
> >> locale, but we'd also like to harmonize with the distros if there
> >> is anything we aren't doing right.
> >
> > What is the intended differnce between this locale and the i18n locale
> > of ISO TR 30112?
> 
> Do you mean the difference between C.UTF-8 and the "i18n.UTF-8" locale
> defined in glibc/localedata/locales/i18n?
> 
> (i.e. a i18n.UTF-8 localed created with
> 
>     localedef --no-archive -ci i18n -f UTF-8 /usr/lib/locale/i18n.utf8
> )
> 
> "C.UTF-8" tries to be the same as C/POSIX wherever possible,
> it only uses UTF-8 encoding and extends the supported
> character range to all of Unicocde. "i18n.UTF-8" has more
> differences to C/POSIX than that.
> 
> Differences in detail are:
> 
> LC_CTYPE
>    almost the same
>    - C.UTF-8 just copies the LC_CTYPE from "i18n" (Which is kept
>      in sync with the latest Unicode release using some scripts) and
>      adds "translit_combining".

So C.UTF-8 will have the full character-class data? I'm in favor of
that but just want to clarify, since omitting it would also be
possible.

> LC_COLLATE
>    - C.UTF-8 sorts via Unicode code point order.
>      For the ASCII range that is the same order as the C/POSIX locale
>      so this gives the traditional sorting for the ASCII range.
>    - i18n sorts according to ISO 14651 which is default Unicode
>      collation order. That happens to be the same in the ASCII range.
>      Locales like en_GB and en_US which just copy the  ISO/IEC 14651
>      template
>      
>         LC_COLLATE
>         % Copy the template from ISO/IEC 14651
>         copy "iso14651_t1"
>         END LC_COLLATE
> 
>      sort lower case letters before upper case letters. But
>      i18n.UTF-8 does some extra stuff before copying the template
>      which fixes this. So i18n.UTF-8 sorts the same way in the
>      ASCII range.
>      
>      Do we care how a C.UTF-8 locale sorts outside of the ASCII range?
>      If we do not care much, Unicode code point order is an easy
>      way to get a consistent order. On the other hand this order
>      is sometimes not really useful. Would it be better to sort
>      according to the default Unicode collation order for characters
>      outside of the ASCII range???

I strongly prefer codepoint ordering for anything that's nominally a
"C" locale, since I expect that's what users want. And since we need
an ordering for illegal sequences too, it makes sense to just do code
unit ordering (i.e. plain strcmp) because that's simultaneously very
fast, compatible with codepoint ordering, and yields a total order on
illegal sequences too.

> LC_MONETARY
>    - C.UTF-8 tries to agree with C/POSIX as much as possible
>      and thus uses "USD" for int_curr_symbol, "$" for currency_symbol,
>      and "." for mon_decimal_point.

This is incorrect, at least based on the spec. C requires the values
for int_curr_symbol and currency_symbol to be "" in the C locale (7.11
Localization <locale.h>, paragraph 2). I think the values you cited
are from en_US.

>    - i18n.UTF-8 uses "XDR" for int_curr_symbol
>      (special drawing right issued by the International Monetary Fund,
>      see https://en.wikipedia.org/wiki/ISO_4217), "Â" for currency_symbol,
>      and "," for mon_decimal_point
> 
> LC_NUMERIC
>    - C.UTF-8 uses "." for decimal_point (like C/POSIX)
>    - i18n.UTF-8 uses "," for decimal_point

Uhg. Let's not go there though. Of course C.UTF-8 needs to match the C
locale in this regard.

> LC_TIME
>    - C.UTF-8 uses the same as C/POSIX
>    - i18n.UTF-8 uses something "more international, less English"
>      for example the month and day names are just numbers in
>      i18n.UTF-8:
>         $ LC_ALL=C.UTF-8 date
>         Tue Oct 27 10:28:44 CET 2015
>         $ LC_ALL=i18n.UTF-8 date
>         3 10 27 10:29:02 CET 2015
>         $ 
> 
> LC_MESSAGES
>    - C.UTF-8 uses the same as C/POSIX
>      (for example yesexpr "^[yY]" and noexpr "^[nN]"
>    - i18n.UTF-8 apparently tries to avoid English
>      (for example yesexpr  "^[+1]" and noexpr "^[-0]")

What about error messages? This is probably off-topic, but it might be
nice if i18n used the actual errno macro names as strings ("ENOENT",
etc.) if it doesn't already.

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]