Summary: | Restrict UTF-8 to 17 planes, as required by RFC 3629 | ||
---|---|---|---|
Product: | glibc | Reporter: | Joe Wells <jbwells> |
Component: | locale | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | bruno, fw, fweimer, glibc-bugs, johannes, roman.zilka |
Priority: | P2 | Flags: | fweimer:
security-
|
Version: | 2.3.6 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: |
Description
Joe Wells
2006-02-21 01:07:39 UTC
I don't agree at all. There is no reason to possibly break someone's code. Nobody has ever shown any evidence why this is a bad idea. Current consensus appears to be that this is indeed a glibc bug. I'm marking this particular encoding bug as security- because due to various factors (lack of wchar_t use, lack of defined character properties in astral planes), a security impact seems unlikely. Re comment 1: > Nobody has ever shown any evidence why this is a bad idea. Validing input is rule #1 among the "secure coding practices", e.g. https://www.securecoding.cert.org/confluence/display/seccode/Top+10+Secure+Coding+Practices Also frequently mentioned on http://cwe.mitre.org/top25/ Related attacks exist, e.g. https://capec.mitre.org/data/definitions/80.html The related bug #19727 has been fixed. *** Bug 26034 has been marked as a duplicate of this bug. *** I was looking around for the ISO, but only found this: https://unicode.org/L2/L2010/10038-fcd10646-main.pdf which does in fact also specify only up to 0x10ffff. So maybe that *did* get settled, which the original report mentioned. https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf seems to say the same, and the unicode website says: "This version of the Unicode Standard is also synchronized with ISO/IEC 10646:2020, sixth edition." RFC 2044 defines UTF-8 as a 1-6 octet encoding, referencing ISO/IEC 10646-1:1993 as the source. This was eventually updated by RFC 3629, which introduced the U+10FFFF limit, but citing ISO/IEC 10646-1:2000 as without that limit. Oh, ok. The original comment here seemed to imply that ISO was the last one to hold out for more space than the others. To carry over some discussion from the bug I originally filed (which was since closed as duplicate in favour of this one): This came up because Python does this conversion using mbstowcs() and/or mbrtowc(), but then later goes to check that valid characters were returned. The python discussion is here: https://bugs.python.org/issue35883 Note that this isn't just about the range, but also the RFC prohibits the surrogate pair reservations: RFC 3629: The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. (Python internally may actually allow using this in an UTF-8-like encoded string [that they call utf-8b] to carry arbitrary bytes around.) (In reply to Andreas Schwab from comment #7) > RFC 2044 defines UTF-8 as a 1-6 octet encoding, referencing ISO/IEC > 10646-1:1993 as the source. This was eventually updated by RFC 3629, which > introduced the U+10FFFF limit, but citing ISO/IEC 10646-1:2000 as without > that limit. Where? I think RFC 3629 still claims that the six byte limit per codepoint does not exist, in section 10: Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences. How does that disagree with what I wrote? (In reply to Andreas Schwab from comment #7) > RFC 2044 defines UTF-8 as a 1-6 octet encoding, referencing ISO/IEC > 10646-1:1993 as the source. This was eventually updated by RFC 3629, which > introduced the U+10FFFF limit, but citing ISO/IEC 10646-1:2000 as without > that limit. This is very misleading. I have a copy of ISO/IEC 10646-1 : 2000(E), bought directly from ISO, and Annex D (which is normative) still specifies 7FFF FFFF as the maximum UCS-4 value. If UTF-8 is restricted to 17 planes in ISO 10646, this restriction has been introduced in a later version of the standard. I don't think this matters because RFC 3629 is a publicly accessible standard, so I think this is what we should follow anyway. > This is very misleading. I have a copy of ISO/IEC 10646-1 : 2000(E), bought
> directly from ISO, and Annex D (which is normative) still specifies 7FFF
> FFFF as the maximum UCS-4 value.
In which way does that differ from what I wrote?
The limit was in ISO 10646 in the 2011 edition but not in the 2003 edition. https://sourceware.org/legacy-ml/libc-alpha/2012-09/msg00112.html has my notes on a previous investigation of when the limit was introduced. |