This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale

From: "bruno at clisp dot org" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Sat, 09 Apr 2016 16:07:25 +0000
Subject: [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
Auto-submitted: auto-generated
References: <bug-19932-131 at http dot sourceware dot org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

Bruno Haible <bruno at clisp dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bruno at clisp dot org

--- Comment #2 from Bruno Haible <bruno at clisp dot org> ---
> glibc mbrtowc reports an encoding error in the C locale when given a byte
> in the range 128-255 decimal

Assume this is indeed to be considered a bug. Then we need to change the
character encoding that glibc associates with the C locale - because the
mbrtowc behaviour depends on (and must remain consistent with) the character
encoding of the locale. This character encoding, nl_langinfo(CODESET) or
equivalently $(locale charmap), currently is defined as

$ LC_ALL=C locale charmap
ANSI_X3.4-1968

ANSI_X3.4-1968, a.k.a. US-ASCII, is a 7-bit encoding,

To fix this bug, this encoding would need to be changed to an 8-bit encoding.

The question is: Which encoding?

> It was always the intent of POSIX that all 256 bytes be valid characters
> in the C locale

On the other hand, it was always the intent of the glibc i18n design (around
1999-2001) that users would use UTF-8 locales and that all plain text would be
encoded in UTF-8. This has come true (around 2005).

The C locale is still used in scripts that need to handle text in unknown
encodings. It is important here that no byte value >= 128 is considered to
have special character properties (per <ctype.h>), because this would have
undesired effects when processing byte sequences in UTF-8 encoding - which,
as said above, is the vast majority of text on current systems.

Therefore, when changing the value of nl_langinfo(CODESET) and
$(locale charmap), it is essential that we preserve the property that
no byte value >= 128 has special character properties. Otherwise we introduce
trouble in user scripts that have been working fine for the last 10 years.

In particular, this excludes the ISO-8859-* encodings.

We need an encoding that formally has 256 characters, but the characters
>= 128 are to be considered non-graphic (and therefore also non-printing).
And the mapping done by mbrtowc should not map these characters to defined
Unicode characters; I think they would best map into Private Use Areas of
Unicode. Thus the mapping table would
- map x (0 <= x <= 0x7F) to Unicode x,
- map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar).

There is no such encoding among the list of encodings - $(locale -m) or
http://www.haible.de/bruno/charsets/conversion-tables/index.html.
Should we create a new encoding with this property?
Or change the mapping tables of ANSI_X3.4-1968?
Either approach will create trouble to user programs:
- If we create a new encoding, software like telnet or ssh passes the
  encoding to different machines, which will not recognize it.
- If we change the mapping tables of ANSI_X3.4-1968, existing uses of
  "iconv -f ANSI_X3.4-1968" will exhibit a behaviour change.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

References:
- [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale
  - From: eggert at gnu dot org

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]