This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Should glibc provide a builtin C.UTF-8 locale?


Hi Mike

Yes, ISO TR 30112 i18n and glibc i18n are essentially the same, as 
ISO 30112 builds on a bit old copy of glibc i18n locale.
In turn the glibc i18n  locale was built on ISO TR 14652 i18n 
locale, so this is a fruitful relation. ISO 30112 is the followup
spec on ISO 14652, and ISO 30112 has catched up with some glibc development.

Thanks for the documentation. Is this available somewhere? Link?

Best regards
keld

On Tue, Oct 27, 2015 at 01:22:34PM +0100, Mike FABIAN wrote:
> keld@keldix.com wrote:
> 
> > On Wed, Oct 21, 2015 at 02:34:22PM -0400, Carlos O'Donell wrote:
> >> On 10/21/2015 01:49 PM, Mike Frysinger wrote:
> >>
> >> Looks good to me.
> >> 
> >> Fedora support is here:
> >> http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?id=0457f649e3fe6299efe384da13dfc923bbe65707
> >> 
> >> Patch for C.UTF-8:
> >> https://lists.fedoraproject.org/pipermail/glibc/2015-September/000081.html
> >> 
> >> The patch lists some if the differences between debian and fedora.
> >> 
> >> We are testing C.UTF-8 in rawhide and I expect Mike FABIAN wants to
> >> submit this upstream at some point to become the official C.UTF-8
> >> locale, but we'd also like to harmonize with the distros if there
> >> is anything we aren't doing right.
> >
> > What is the intended differnce between this locale and the i18n locale
> > of ISO TR 30112?
> 
> Do you mean the difference between C.UTF-8 and the "i18n.UTF-8" locale
> defined in glibc/localedata/locales/i18n?
> 
> (i.e. a i18n.UTF-8 localed created with
> 
>     localedef --no-archive -ci i18n -f UTF-8 /usr/lib/locale/i18n.utf8
> )
> 
> "C.UTF-8" tries to be the same as C/POSIX wherever possible,
> it only uses UTF-8 encoding and extends the supported
> character range to all of Unicocde. "i18n.UTF-8" has more
> differences to C/POSIX than that.
> 
> Differences in detail are:
> 
> LC_CTYPE
>    almost the same
>    - C.UTF-8 just copies the LC_CTYPE from "i18n" (Which is kept
>      in sync with the latest Unicode release using some scripts) and
>      adds "translit_combining".
> 
> LC_COLLATE
>    - C.UTF-8 sorts via Unicode code point order.
>      For the ASCII range that is the same order as the C/POSIX locale
>      so this gives the traditional sorting for the ASCII range.
>    - i18n sorts according to ISO 14651 which is default Unicode
>      collation order. That happens to be the same in the ASCII range.
>      Locales like en_GB and en_US which just copy the  ISO/IEC 14651
>      template
>      
>         LC_COLLATE
>         % Copy the template from ISO/IEC 14651
>         copy "iso14651_t1"
>         END LC_COLLATE
> 
>      sort lower case letters before upper case letters. But
>      i18n.UTF-8 does some extra stuff before copying the template
>      which fixes this. So i18n.UTF-8 sorts the same way in the
>      ASCII range.
>      
>      Do we care how a C.UTF-8 locale sorts outside of the ASCII range?
>      If we do not care much, Unicode code point order is an easy
>      way to get a consistent order. On the other hand this order
>      is sometimes not really useful. Would it be better to sort
>      according to the default Unicode collation order for characters
>      outside of the ASCII range???
> 
> LC_MONETARY
>    - C.UTF-8 tries to agree with C/POSIX as much as possible
>      and thus uses "USD" for int_curr_symbol, "$" for currency_symbol,
>      and "." for mon_decimal_point.
>    - i18n.UTF-8 uses "XDR" for int_curr_symbol
>      (special drawing right issued by the International Monetary Fund,
>      see https://en.wikipedia.org/wiki/ISO_4217), "¤" for currency_symbol,
>      and "," for mon_decimal_point
> 
> LC_NUMERIC
>    - C.UTF-8 uses "." for decimal_point (like C/POSIX)
>    - i18n.UTF-8 uses "," for decimal_point
> 
> LC_TIME
>    - C.UTF-8 uses the same as C/POSIX
>    - i18n.UTF-8 uses something "more international, less English"
>      for example the month and day names are just numbers in
>      i18n.UTF-8:
>         $ LC_ALL=C.UTF-8 date
>         Tue Oct 27 10:28:44 CET 2015
>         $ LC_ALL=i18n.UTF-8 date
>         3 10 27 10:29:02 CET 2015
>         $ 
> 
> LC_MESSAGES
>    - C.UTF-8 uses the same as C/POSIX
>      (for example yesexpr "^[yY]" and noexpr "^[nN]"
>    - i18n.UTF-8 apparently tries to avoid English
>      (for example yesexpr  "^[+1]" and noexpr "^[-0]")
> 
> LC_PAPER
>    No difference between C.UTF-8 and i18n.UTF-8, both use
>    A4 paper, just like C/POSIX.
> 
> LC_NAME
>    No difference between C.UTF-8, i18n.UTF-8, and C/POSIX
> 
> LC_ADDRESS
>    No difference between C.UTF-8, i18n.UTF-8, and C/POSIX
> 
> LC_TELEPHONE
>    C.UTF-8 has tel_int_fmt "+%c %a %l" (same as C/POSIX)
>    i18n.UTF-8 has tel_int_fmt "+%c +a +l" <-- that looks like a bug, doesn???t it??
> 
> LC_MEASUREMENT
>    No difference between C.UTF-8, i18n.UTF-8, and C/POSIX,
>    all muse metric measurement.
> 
> -- 
> Mike FABIAN <mfabian@redhat.com>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]