Bug 26034 - mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale
Summary: mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF...
Status: RESOLVED DUPLICATE of bug 2373
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-24 19:33 UTC by Johannes Berg
Modified: 2020-06-02 11:39 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
simple test program (384 bytes, text/x-csrc)
2020-05-24 19:33 UTC, Johannes Berg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Johannes Berg 2020-05-24 19:33:31 UTC
Created attachment 12566 [details]
simple test program

It seems that according to RFC 3629, -1 should be returned here since an invalid character is encoded, that's U+2f43580, not in range [U+0, U+10ffff].

This came up because Python does this conversion using mbstowcs() and/or mbrtowc(), but then later goes to check that valid characters were returned.

The python discussion is here:

https://bugs.python.org/issue35883

but given the language in RFC 3629, it seems like an issue in glibc:


3.  UTF-8 definition

[...]

   In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
   accessible range) are encoded using sequences of 1 to 4 octets.

[...]

      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

[...]

   Implementations of the decoding algorithm above MUST protect against
   decoding invalid sequences.

[...]
Comment 1 jsm-csl@polyomino.org.uk 2020-06-01 22:38:04 UTC
Probably the same issue as bug 2373.
Comment 2 Johannes Berg 2020-06-02 11:06:49 UTC
Hm, yeah, that sounds the same, I had only searched for the specific function(s), not the broader issue. I guess I won't hold my breath for this to get fixed then ...
Comment 3 Florian Weimer 2020-06-02 11:33:09 UTC
Marking as duplicate per comment 2.

*** This bug has been marked as a duplicate of bug 2373 ***
Comment 4 Johannes Berg 2020-06-02 11:35:13 UTC
Hm, note though. I was just mentioning this to somebody, and 2373 talks about *encoding* while this is mostly about *decoding*. So it's related, but not exactly the same. Up to you whether or not you want to treat it as a duplicate, but it's two sides of the same coin. An argument could be made, for example, for allowing *encoding* it (since why did the application store something >0x10ffff in a wchar_t to start with, that was already invalid) but not *decoding* it, even if that breaks the round-trip property.
Comment 5 Florian Weimer 2020-06-02 11:39:20 UTC
Fair point. I have retitled bug 2373 accordingly.