Created attachment 9173 [details] test mbrtowc in the C locale This follows up on a bug reported by Björn Jacke against GNU grep 2.23; see <http://bugs.gnu.org/23234>. The bug occurs because GNU grep uses mbrtowc to detect encoding errors, and because glibc mbrtowc reports an encoding error in the C locale when given a byte in the range 128-255 decimal. It was always the intent of POSIX that all 256 bytes be valid characters in the C locale, as that was the traditional behavior. This wasn't clearly stated in the standard, but this is a bug that is planned to be fixed in a future version of POSIX; see <http://austingroupbugs.net/view.php?id=663#c2738> (2015-07-02). Glibc should be fixed to conform to this, i.e., mbrtowc should never return (size_t) -1 in the C locale. I plan to work around this bug in the gnulib mbrtowc module, which should fix the grep bug; but this is a hack and will slow grep down a bit. The bug should be fixed in glibc. Please see the attached program for an illustration of the bug. The program should output nothing and exit with status 0, but on glibc it outputs lines like the following: byte 0x80 (0200) encoding error byte 0x81 (0201) encoding error ... byte 0xff (0377) encoding error and exits with status 1.
Thank you for driving this.
> glibc mbrtowc reports an encoding error in the C locale when given a byte > in the range 128-255 decimal Assume this is indeed to be considered a bug. Then we need to change the character encoding that glibc associates with the C locale - because the mbrtowc behaviour depends on (and must remain consistent with) the character encoding of the locale. This character encoding, nl_langinfo(CODESET) or equivalently $(locale charmap), currently is defined as $ LC_ALL=C locale charmap ANSI_X3.4-1968 ANSI_X3.4-1968, a.k.a. US-ASCII, is a 7-bit encoding, To fix this bug, this encoding would need to be changed to an 8-bit encoding. The question is: Which encoding? > It was always the intent of POSIX that all 256 bytes be valid characters > in the C locale On the other hand, it was always the intent of the glibc i18n design (around 1999-2001) that users would use UTF-8 locales and that all plain text would be encoded in UTF-8. This has come true (around 2005). The C locale is still used in scripts that need to handle text in unknown encodings. It is important here that no byte value >= 128 is considered to have special character properties (per <ctype.h>), because this would have undesired effects when processing byte sequences in UTF-8 encoding - which, as said above, is the vast majority of text on current systems. Therefore, when changing the value of nl_langinfo(CODESET) and $(locale charmap), it is essential that we preserve the property that no byte value >= 128 has special character properties. Otherwise we introduce trouble in user scripts that have been working fine for the last 10 years. In particular, this excludes the ISO-8859-* encodings. We need an encoding that formally has 256 characters, but the characters >= 128 are to be considered non-graphic (and therefore also non-printing). And the mapping done by mbrtowc should not map these characters to defined Unicode characters; I think they would best map into Private Use Areas of Unicode. Thus the mapping table would - map x (0 <= x <= 0x7F) to Unicode x, - map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar). There is no such encoding among the list of encodings - $(locale -m) or http://www.haible.de/bruno/charsets/conversion-tables/index.html. Should we create a new encoding with this property? Or change the mapping tables of ANSI_X3.4-1968? Either approach will create trouble to user programs: - If we create a new encoding, software like telnet or ssh passes the encoding to different machines, which will not recognize it. - If we change the mapping tables of ANSI_X3.4-1968, existing uses of "iconv -f ANSI_X3.4-1968" will exhibit a behaviour change.
(In reply to Bruno Haible from comment #2) > Thus the mapping table would > - map x (0 <= x <= 0x7F) to Unicode x, > - map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar). Emacs maps the latter to 0x3FFF80+x, I suppose under the theory that these integers are not Unicode code points, and thus won't be conflated with private-use Unicode characters. I suppose we could be "compatible" with Emacs. Are there other examples in the wild of this sort of thing, or is the Emacs precedent good enough? > Should we create a new encoding with this property? > Or change the mapping tables of ANSI_X3.4-1968? It is a bit of a dilemma. Would it make sense to change iconv so that it recognizes values like 0x3FFF80 as corresponding to encoding-error bytes? iconv could then behave the same way as before, even if we change the mapping tables of ANSI_X3.4-1968.
The POSIX bug has been fixed: https://pubs.opengroup.org/onlinepubs/9699919799/functions/mbrtowc.html now says "[EILSEQ] An invalid character sequence is detected. [CX] [Option Start] In the POSIX locale an [EILSEQ] error cannot occur since all byte values are valid characters. [Option End]"