Bug 19932 - C locale: mbrtowc returns (size_t) -1
Summary: C locale: mbrtowc returns (size_t) -1
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.22
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-09 08:14 UTC by Paul Eggert
Modified: 2023-06-28 20:12 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
test mbrtowc in the C locale (448 bytes, text/x-csrc)
2016-04-09 08:14 UTC, Paul Eggert
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Paul Eggert 2016-04-09 08:14:46 UTC
Created attachment 9173 [details]
test mbrtowc in the C locale

This follows up on a bug reported by Björn Jacke against GNU grep 2.23; see <http://bugs.gnu.org/23234>. The bug occurs because GNU grep uses mbrtowc to detect encoding errors, and because glibc mbrtowc reports an encoding error in the C locale when given a byte in the range 128-255 decimal.

It was always the intent of POSIX that all 256 bytes be valid characters in the C locale, as that was the traditional behavior. This wasn't clearly stated in the standard, but this is a bug that is planned to be fixed in a future version of POSIX; see <http://austingroupbugs.net/view.php?id=663#c2738> (2015-07-02). Glibc should be fixed to conform to this, i.e., mbrtowc should never return (size_t) -1 in the C locale.

I plan to work around this bug in the gnulib mbrtowc module, which should fix the grep bug; but this is a hack and will slow grep down a bit. The bug should be fixed in glibc.

Please see the attached program for an illustration of the bug. The program should output nothing and exit with status 0, but on glibc it outputs lines like the following:

byte 0x80 (0200) encoding error
byte 0x81 (0201) encoding error
...
byte 0xff (0377) encoding error

and exits with status 1.
Comment 1 jim@meyering.net 2016-04-09 15:50:09 UTC
Thank you for driving this.
Comment 2 Bruno Haible 2016-04-09 16:07:25 UTC
> glibc mbrtowc reports an encoding error in the C locale when given a byte
> in the range 128-255 decimal

Assume this is indeed to be considered a bug. Then we need to change the
character encoding that glibc associates with the C locale - because the
mbrtowc behaviour depends on (and must remain consistent with) the character
encoding of the locale. This character encoding, nl_langinfo(CODESET) or
equivalently $(locale charmap), currently is defined as

$ LC_ALL=C locale charmap
ANSI_X3.4-1968

ANSI_X3.4-1968, a.k.a. US-ASCII, is a 7-bit encoding,

To fix this bug, this encoding would need to be changed to an 8-bit encoding.

The question is: Which encoding?

> It was always the intent of POSIX that all 256 bytes be valid characters
> in the C locale

On the other hand, it was always the intent of the glibc i18n design (around
1999-2001) that users would use UTF-8 locales and that all plain text would be
encoded in UTF-8. This has come true (around 2005).

The C locale is still used in scripts that need to handle text in unknown
encodings. It is important here that no byte value >= 128 is considered to
have special character properties (per <ctype.h>), because this would have
undesired effects when processing byte sequences in UTF-8 encoding - which,
as said above, is the vast majority of text on current systems.

Therefore, when changing the value of nl_langinfo(CODESET) and
$(locale charmap), it is essential that we preserve the property that
no byte value >= 128 has special character properties. Otherwise we introduce
trouble in user scripts that have been working fine for the last 10 years.

In particular, this excludes the ISO-8859-* encodings.

We need an encoding that formally has 256 characters, but the characters
>= 128 are to be considered non-graphic (and therefore also non-printing).
And the mapping done by mbrtowc should not map these characters to defined
Unicode characters; I think they would best map into Private Use Areas of
Unicode. Thus the mapping table would
- map x (0 <= x <= 0x7F) to Unicode x,
- map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar).

There is no such encoding among the list of encodings - $(locale -m) or
http://www.haible.de/bruno/charsets/conversion-tables/index.html.
Should we create a new encoding with this property?
Or change the mapping tables of ANSI_X3.4-1968?
Either approach will create trouble to user programs:
- If we create a new encoding, software like telnet or ssh passes the
  encoding to different machines, which will not recognize it.
- If we change the mapping tables of ANSI_X3.4-1968, existing uses of
  "iconv -f ANSI_X3.4-1968" will exhibit a behaviour change.
Comment 3 Paul Eggert 2016-04-09 17:56:51 UTC
(In reply to Bruno Haible from comment #2)
> Thus the mapping table would
> - map x (0 <= x <= 0x7F) to Unicode x,
> - map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar).

Emacs maps the latter to 0x3FFF80+x, I suppose under the theory that these integers are not Unicode code points, and thus won't be conflated with private-use Unicode characters. I suppose we could be "compatible" with Emacs. Are there other examples in the wild of this sort of thing, or is the Emacs precedent good enough?

> Should we create a new encoding with this property?
> Or change the mapping tables of ANSI_X3.4-1968?

It is a bit of a dilemma. Would it make sense to change iconv so that it recognizes values like 0x3FFF80 as corresponding to encoding-error bytes? iconv could then behave the same way as before, even if we change the mapping tables of ANSI_X3.4-1968.
Comment 4 Bruno Haible 2023-03-29 09:42:00 UTC
The POSIX bug has been fixed:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mbrtowc.html
now says
"[EILSEQ]
    An invalid character sequence is detected. [CX] [Option Start]  In the POSIX locale an [EILSEQ] error cannot occur since all byte values are valid characters. [Option End]"