This is the mail archive of the
mailing list for the glibc project.
Re: Output of `locale -a` could be in mixed encodings?
- From: Joseph Myers <joseph at codesourcery dot com>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>, <libc-locales at sourceware dot org>
- Date: Wed, 21 Jan 2015 02:18:10 +0000
- Subject: Re: Output of `locale -a` could be in mixed encodings?
- Authentication-results: sourceware.org; auth=none
- References: <54BF0329 dot 5050604 at redhat dot com>
On Tue, 20 Jan 2015, Carlos O'Donell wrote:
> The problem then is that if you took that UTF8 converted name of
> `bokmÃl` and tried to call setlocale with that, it would fail.
> It fails because the name in UTF8 doesn't match the name in
> ISO-8859-1 that's stored as the alias or official locale name.
This could be a bug in setlocale.
POSIX says the locale name is a "character string", which is defined as a
sequence of multibyte characters. So arguably it should be interpreted in
the current locale's character set (and so work if the LC_CTYPE before
setlocale is that of a UTF-8 locale, fail if it's ASCII or ISO-8859-1).
Except that the statement about being a character string is not CX-shaded,
so should not be taken as intending any semantics beyond those in ISO C,
and I don't see ISO C requiring any such thing. (That said, I think
interpreting the locale name in the current locale makes sense anyway, and
is at least consistent with ISO C, even if not required.)
Now, we should also probably say that all non-ASCII locale names are
deprecated (so this would just be a matter of adding a few more aliases
for this locale using different encodings). And then we could say that
the locale utility doesn't output any non-ASCII locale names - as long as
each locale has a valid ASCII name, I think that's conforming to POSIX.
In fact, these aliases are already deprecated (locale.alias says "This
file is obsolete ... Nobody should rely on the names defined here").
It's also the case that there's an existing weak deprecation of non-UTF-8
locales (in the sense that every locale with a non-UTF-8 character set is
supposed to have a corresponding locale with UTF-8 character set - if any
don't, that's a bug unless there's some other reason for the locale to be
deprecated whatever the character set - and the threshold for adding any
new non-UTF-8 locales should be higher than for adding new UTF-8 locales).
> language | Norwegian, Bokm<E5>l
That part of the output, however, should clearly be output in the user's
locale character set - not in the character set of the locale in question.
Joseph S. Myers