26034 – mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Bug 26034 - mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Summary: mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF...

Status:	RESOLVED DUPLICATE of bug 2373

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	locale (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-05-24 19:33 UTC by Johannes Berg
Modified:	2020-06-02 11:39 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
simple test program (384 bytes, text/x-csrc) 2020-05-24 19:33 UTC, Johannes Berg	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Johannes Berg 2020-05-24 19:33:31 UTC

Created attachment 12566 [details]
simple test program

It seems that according to RFC 3629, -1 should be returned here since an invalid character is encoded, that's U+2f43580, not in range [U+0, U+10ffff].

This came up because Python does this conversion using mbstowcs() and/or mbrtowc(), but then later goes to check that valid characters were returned.

The python discussion is here:

https://bugs.python.org/issue35883

but given the language in RFC 3629, it seems like an issue in glibc:


3.  UTF-8 definition

[...]

   In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
   accessible range) are encoded using sequences of 1 to 4 octets.

[...]

      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

[...]

   Implementations of the decoding algorithm above MUST protect against
   decoding invalid sequences.

[...]

Comment 1 jsm-csl@polyomino.org.uk 2020-06-01 22:38:04 UTC

Probably the same issue as bug 2373.

Comment 2 Johannes Berg 2020-06-02 11:06:49 UTC

Hm, yeah, that sounds the same, I had only searched for the specific function(s), not the broader issue. I guess I won't hold my breath for this to get fixed then ...

Comment 3 Florian Weimer 2020-06-02 11:33:09 UTC

Marking as duplicate per comment 2.

*** This bug has been marked as a duplicate of bug 2373 ***

Comment 4 Johannes Berg 2020-06-02 11:35:13 UTC

Hm, note though. I was just mentioning this to somebody, and 2373 talks about *encoding* while this is mostly about *decoding*. So it's related, but not exactly the same. Up to you whether or not you want to treat it as a duplicate, but it's two sides of the same coin. An argument could be made, for example, for allowing *encoding* it (since why did the application store something >0x10ffff in a wchar_t to start with, that was already invalid) but not *decoding* it, even if that breaks the round-trip property.

Comment 5 Florian Weimer 2020-06-02 11:39:20 UTC

Fair point. I have retitled bug 2373 accordingly.