This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
- From: "bruno at clisp dot org" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Sat, 09 Apr 2016 16:07:25 +0000
- Subject: [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
- Auto-submitted: auto-generated
- References: <bug-19932-131 at http dot sourceware dot org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
Bruno Haible <bruno at clisp dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |bruno at clisp dot org
--- Comment #2 from Bruno Haible <bruno at clisp dot org> ---
> glibc mbrtowc reports an encoding error in the C locale when given a byte
> in the range 128-255 decimal
Assume this is indeed to be considered a bug. Then we need to change the
character encoding that glibc associates with the C locale - because the
mbrtowc behaviour depends on (and must remain consistent with) the character
encoding of the locale. This character encoding, nl_langinfo(CODESET) or
equivalently $(locale charmap), currently is defined as
$ LC_ALL=C locale charmap
ANSI_X3.4-1968
ANSI_X3.4-1968, a.k.a. US-ASCII, is a 7-bit encoding,
To fix this bug, this encoding would need to be changed to an 8-bit encoding.
The question is: Which encoding?
> It was always the intent of POSIX that all 256 bytes be valid characters
> in the C locale
On the other hand, it was always the intent of the glibc i18n design (around
1999-2001) that users would use UTF-8 locales and that all plain text would be
encoded in UTF-8. This has come true (around 2005).
The C locale is still used in scripts that need to handle text in unknown
encodings. It is important here that no byte value >= 128 is considered to
have special character properties (per <ctype.h>), because this would have
undesired effects when processing byte sequences in UTF-8 encoding - which,
as said above, is the vast majority of text on current systems.
Therefore, when changing the value of nl_langinfo(CODESET) and
$(locale charmap), it is essential that we preserve the property that
no byte value >= 128 has special character properties. Otherwise we introduce
trouble in user scripts that have been working fine for the last 10 years.
In particular, this excludes the ISO-8859-* encodings.
We need an encoding that formally has 256 characters, but the characters
>= 128 are to be considered non-graphic (and therefore also non-printing).
And the mapping done by mbrtowc should not map these characters to defined
Unicode characters; I think they would best map into Private Use Areas of
Unicode. Thus the mapping table would
- map x (0 <= x <= 0x7F) to Unicode x,
- map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar).
There is no such encoding among the list of encodings - $(locale -m) or
http://www.haible.de/bruno/charsets/conversion-tables/index.html.
Should we create a new encoding with this property?
Or change the mapping tables of ANSI_X3.4-1968?
Either approach will create trouble to user programs:
- If we create a new encoding, software like telnet or ssh passes the
encoding to different machines, which will not recognize it.
- If we change the mapping tables of ANSI_X3.4-1968, existing uses of
"iconv -f ANSI_X3.4-1968" will exhibit a behaviour change.
--
You are receiving this mail because:
You are on the CC list for the bug.