Bug 16621 - C.UTF-8 locales should be regarded like C w.r.t. $LANGUAGE precedence
Summary: C.UTF-8 locales should be regarded like C w.r.t. $LANGUAGE precedence
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.35
: P2 normal
Target Milestone: 2.39
Assignee: Florian Weimer
URL:
Keywords:
: 29777 (view as bug list)
Depends on: 17318
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-21 12:47 UTC by Vincent Lefèvre
Modified: 2024-09-06 14:57 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefèvre 2014-02-21 12:47:41 UTC
Scripts tend to use LC_ALL=C.UTF-8 instead of LC_ALL=C for UTF-8 support and to behave in a locale-independent manner. However $LANGUAGE is still taken into account by glibc:

xvii% LANGUAGE=fr_FR LC_ALL=C.UTF-8 cp
cp: opérande de fichier manquant
Saisissez « cp --help » pour plus d'informations.
xvii% LANGUAGE=fr_FR LC_ALL=C cp
cp: missing file operand
Try 'cp --help' for more information.

Both should have output in English.

Glibc should apply the same rules with C.UTF-8 as with C locales.

Also reported in Debian:
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719590
Comment 1 Andreas Schwab 2014-02-21 12:58:11 UTC
There is no C.UTF-8 locale in glibc.
Comment 2 Vincent Lefèvre 2014-02-21 13:41:23 UTC
(In reply to Andreas Schwab from comment #1)
> There is no C.UTF-8 locale in glibc.

That's strange, because in the Subversion mailing-list, it was regarded as standard. Subversion works well only in UTF-8 locales, and the suggested solution was to use C.UTF-8: http://mail-archives.apache.org/mod_mbox/subversion-users/201307.mbox/%3C51DC54AD.7010601@wandisco.com%3E
Comment 3 Nick Coghlan 2014-08-27 12:59:06 UTC
I have filed bug #17318 requesting the inclusion of a C.UTF-8 locale in upstream glibc (actually prompted by https://bugzilla.redhat.com/show_bug.cgi?id=902094, but I found this bug while looking to see if anyone else had already made the request)
Comment 4 Mike Frysinger 2015-08-29 20:41:37 UTC
glibc doesn't provide a C.UTF-8, so any bug report about it makes no sense
Comment 5 Nick Coghlan 2015-08-30 04:38:16 UTC
While it's true that glibc itself doesn't provide a C.UTF-8 locale, does that really make this bug report invalid?

The Debian-derived family of distros default to adding a C.UTF-8 locale at the distro level, but it doesn't quite work as expected, as it's missing some of the special casing afforded the default C locale. The specific one covered by this BZ is the face that LC_ALL=C will make glibc ignore the LANGUAGE setting, but LC_ALL=C.UTF-8 doesn't.

Another possible way of phrasing the request would be for all "C.*" locales to ignore the LANGUAGE setting the same way the unmodified "C" locale does, rather than special casing "C.UTF-8". I'm not *personally* aware of any such locales in widespread use other than "C.UTF-8", but that doesn't mean there aren't any.
Comment 6 Mike Frysinger 2015-08-30 05:59:32 UTC
(In reply to Nick Coghlan from comment #5)

bugs in distros aren't really the domain of glibc upstream.  if you think the proposal in bug 17318 has limitations or you have concerns, you should post it there or the mailing list thread on the topic.
Comment 7 Nick Coghlan 2015-08-30 06:19:55 UTC
I filed #17318 because Fedora doesn't want to add C.UTF-8 independently of upstream glibc (at least in part to avoid inconsistencies like the one reported here).

However, I also interpret the current bug closure as categorically rejecting the notion of treating C.UTF-8 the same as the C locale when it comes to the LANGUAGE variable, which doesn't seem like the correct outcome.

If I've misunderstood what "CLOSED INVALID" means and the intent is for bug #17318 to include the behaviour requested here, then yes, I would consider that a reasonable way to resolve this issue.
Comment 8 nl6720 2022-08-02 10:36:56 UTC
glibc 2.35 has C.UTF-8 now, so it would make make sense reopen this.
Comment 9 Vincent Lefèvre 2022-08-02 11:38:39 UTC
Reopening as glibc 2.35 has C.UTF-8 (comment 8).
Comment 10 Florian Weimer 2022-08-02 13:40:30 UTC
C is special-cased here:

  /* Ignore LANGUAGE and its system-dependent analogon if the locale is set
     to "C" because
     1. "C" locale usually uses the ASCII encoding, and most international
        messages use non-ASCII characters. These characters get displayed
        as question marks (if using glibc's iconv()) or as invalid 8-bit
        characters (because other iconv()s refuse to convert most non-ASCII
        characters to ASCII). In any case, the output is ugly.
     2. The precise output of some programs in the "C" locale is specified
        by POSIX and should not depend on environment variables like
        "LANGUAGE" or system-dependent information.  We allow such programs
        to use gettext().  */
  if (strcmp (locale, "C") == 0)
    return locale;

It looks like the locale name is not embedded in the locale data itself, so identifying C.UTF-8 based on its name might not be so simple here.
Comment 11 Vincent Lefèvre 2022-08-02 14:17:45 UTC
(In reply to Florian Weimer from comment #10)
> C is special-cased here:
[...]
>   if (strcmp (locale, "C") == 0)
>     return locale;
> 
> It looks like the locale name is not embedded in the locale data itself, so
> identifying C.UTF-8 based on its name might not be so simple here.

Do you mean that locale is not the string "C.UTF-8" (while setlocale() returns the expected "C.UTF-8")?
Comment 12 Florian Weimer 2022-08-02 14:30:57 UTC
(In reply to Vincent Lefèvre from comment #11)
> (In reply to Florian Weimer from comment #10)
> > C is special-cased here:
> [...]
> >   if (strcmp (locale, "C") == 0)
> >     return locale;
> > 
> > It looks like the locale name is not embedded in the locale data itself, so
> > identifying C.UTF-8 based on its name might not be so simple here.
> 
> Do you mean that locale is not the string "C.UTF-8" (while setlocale()
> returns the expected "C.UTF-8")?

There are aliases such as "C.utf8", which we would have to recognize as well. Not doing that would make things worse, I think.
Comment 13 Vincent Lefèvre 2022-08-02 15:15:59 UTC
(In reply to Florian Weimer from comment #12)
> There are aliases such as "C.utf8", which we would have to recognize as
> well. Not doing that would make things worse, I think.

OK, though I would say that this is mainly useful for scripts, which could always use "C.UTF-8" for better portability. BTW, Debian currently doesn't support aliases for "C.UTF-8" (at least by default).
Comment 14 Florian Weimer 2022-11-14 13:01:30 UTC
*** Bug 29777 has been marked as a duplicate of this bug. ***
Comment 15 Florian Weimer 2023-09-04 13:32:52 UTC
Fix pushed for 2.39:

commit 2897b231a6b71ee17d47d3d63f1112b2641a476c
Author: Bruno Haible <bruno@clisp.org>
Date:   Mon Sep 4 15:31:36 2023 +0200

    intl: Treat C.UTF-8 locale like C locale (BZ# 16621)
    
    The wiki page https://sourceware.org/glibc/wiki/Proposals/C.UTF-8
    says that "Setting LC_ALL=C.UTF-8 will ignore LANGUAGE just like it
    does with LC_ALL=C." This patch implements it.
    
    * intl/dcigettext.c (guess_category_value): Treat C.<encoding> locale
    like the C locale.
    
    Reviewed-by: Florian Weimer <fweimer@redhat.com>

I'm going to post my test, too.
Comment 16 Sourceware Commits 2023-11-20 15:03:52 UTC
The master branch has been updated by Florian Weimer <fw@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c52c2c32db15aba8bbe1a0b4d3235f97d9c1a525

commit c52c2c32db15aba8bbe1a0b4d3235f97d9c1a525
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Nov 20 16:03:11 2023 +0100

    intl: Add test case for bug 16621
    
    Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Comment 17 Florian Weimer 2024-09-06 14:57:13 UTC
Additional fix for 2.39:

commit d0aefec49941cf6d97e2244d6aa20bafc26d5942
Author: Bruno Haible <bruno@clisp.org>
Date:   Tue Dec 12 09:45:16 2023 +0100

    intl: Treat C.UTF-8 locale like C locale, part 2 (BZ# 16621)
    
    The previous commit was incomplete: gettext() still returns a translation
    if the file /usr/share/locale/C/LC_MESSAGES/<domain>.mo exists. This patch
    prohibits the translation also in this case.
    
    * gettext-runtime/intl/dcigettext.c (DCIGETTEXT): Treat C.<encoding> locale
    like the C locale.
    
    Reviewed-by: Florian Weimer <fweimer@redhat.com>