Bug 18412 - 'locale -a' outputs encoding errors
Summary: 'locale -a' outputs encoding errors
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-13 18:50 UTC by Paul Eggert
Modified: 2015-05-22 22:29 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Remove badly-encoded lines from locale.aliases (1010 bytes, application/octet-stream)
2015-05-14 01:38 UTC, Paul Eggert
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Paul Eggert 2015-05-13 18:50:39 UTC
Here are the symptoms, on Fedora 21:

$ locale | grep LC_ALL
LC_ALL=en_US.UTF-8
$ locale -a | grep en_US
Binary file (standard input) matches

The problem is that 'locale -a' is attempting to output the string "bokmål", but it does so using a Latin-1 encoding, which is an encoding error in a UTF-8 locale.  'locale -a' should always output properly-encoded text.
Comment 1 joseph@codesourcery.com 2015-05-13 23:01:42 UTC
libc-alpha discussion: starts 
<https://sourceware.org/ml/libc-alpha/2015-01/msg00379.html>, my comments 
<https://sourceware.org/ml/libc-alpha/2015-01/msg00382.html>.
Comment 2 Paul Eggert 2015-05-14 01:38:01 UTC
Created attachment 8315 [details]
Remove badly-encoded lines from locale.aliases
Comment 3 Paul Eggert 2015-05-14 01:40:01 UTC
Thanks for reminding me about the old discussion; I'd forgotten it, and filed this bug report only because I personally ran into the bug again.  Let's bite the bullet and fix it; I created a proposed patch.  We should stamp out those poorly-encoded locale aliases anyway.
Comment 4 Carlos O'Donell 2015-05-14 05:06:56 UTC
(In reply to Paul Eggert from comment #3)
> Thanks for reminding me about the old discussion; I'd forgotten it, and
> filed this bug report only because I personally ran into the bug again. 
> Let's bite the bullet and fix it; I created a proposed patch.  We should
> stamp out those poorly-encoded locale aliases anyway.

I wanted to have some kind of compatibility for the removed entries, but the more I think about it the more work it will be. We'll need another configuration file to handle the compat entries, and to allow users to remove them, and that's a terrible solution.

Could we skip printing non-ASCII aliases? Document that as the expected behaviour? Then add comments in locale.alias saying these two aliases are not printed, but can be used for old program comaptibility?
Comment 5 Paul Eggert 2015-05-14 06:01:48 UTC
> Could we skip printing non-ASCII aliases? Document that as the expected
> behaviour?

We could, if someone wanted to implement all that.  But it'll be better simply to remove the two bad aliases, as that'll be easier to document and so will be easier on the users.  We survived just fine when we removed obsolete Norse aliases before (LC_ALL='no@bokmal' -- remember that?) and we'll survive just fine removing these two bad aliases too.
Comment 6 Carlos O'Donell 2015-05-14 06:27:52 UTC
(In reply to Paul Eggert from comment #5)
> > Could we skip printing non-ASCII aliases? Document that as the expected
> > behaviour?
> 
> We could, if someone wanted to implement all that.  But it'll be better
> simply to remove the two bad aliases, as that'll be easier to document and
> so will be easier on the users.  We survived just fine when we removed
> obsolete Norse aliases before (LC_ALL='no@bokmal' -- remember that?) and
> we'll survive just fine removing these two bad aliases too.

I guess your right in that respect, it is simpler and easier to explain.

OK to commit from my end.

I'd post to libc-alpha and look for one more person to OK this change and then I'd consider it consensus to remove the non-ASCII alias.

Bonus if someone already gave consent in the original discussion, in which case I think we should fix it and move forward.
Comment 7 cvs-commit@gcc.gnu.org 2015-05-22 22:12:41 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  333e1ba4e53456a603621274177ae9393b9d5385 (commit)
      from  60dce8b9044155bb04eb310fb0fc5e9607b7d2e6 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=333e1ba4e53456a603621274177ae9393b9d5385

commit 333e1ba4e53456a603621274177ae9393b9d5385
Author: Paul Eggert <eggert@cs.ucla.edu>
Date:   Fri May 22 14:57:11 2015 -0700

    Remove obsolete aliases that broke 'locale -a'
    
    [BZ #18412]
    * intl/locale.alias: Remove obsolete aliases "bokm�l" and "fran�ais"
    which caused 'locale -a' to output Latin-1 data in UTF-8 locales,
    breaking some applications that use 'locale -a' output.
    Change the encoding of this file from Latin-1 to ASCII to avoid
    other potential problems with people grepping this file.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog         |   10 ++++++++++
 NEWS              |    4 ++--
 intl/locale.alias |   13 +++++++++++--
 3 files changed, 23 insertions(+), 4 deletions(-)
Comment 8 Paul Eggert 2015-05-22 22:29:51 UTC
(In reply to Carlos O'Donell from comment #6)

> Bonus if someone already gave consent in the original discussion, in which
> case I think we should fix it and move forward.

I think that we have consensus enough so I changed the new comments to use ASCII only (a bit safer if people are grepping locale.aliases directly), installed it as commit 333e1ba4e53456a603621274177ae9393b9d5385, and am marking this bug as fixed.

Hmm, I see that <https://sourceware.org/bugzilla/show_bug.cgi?id=18412> got confused by the mixed-encoding patch and the web page is transliterating perfectly good characters like 'å' (U+00E5 LATIN SMALL LETTER A WITH RING ABOVE) in the patch to a blotch (U+FFFD REPLACEMENT CHARACTER) on the web. Oh well the patch itself should be good.