Bug 2373

Summary: Restrict UTF-8 to 17 planes, as required by RFC 3629
Product: glibc Reporter: Joe Wells <jbwells>
Component: localeAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: bruno, fw, fweimer, glibc-bugs, johannes, roman.zilka
Priority: P2 Flags: fweimer: security-
Version: 2.3.6   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Joe Wells 2006-02-21 01:07:39 UTC
Two of the three standards that define "UTF-8" restrict it to encoding
characters whose code points are from 0 to 0x10FFFF.  Only the ISO
standard allows larger code points.  The Unicode standard and the RFC
both insist on only allowing code points up to 0x10FFFF to be encoded
in UTF-8.  The Unicode consortium has pledged never to assign a
character to a code point above 0x10FFFF.  People have predicted that
the ISO will come into agreement on this point soon although it seems
not to have happened yet.

The experts seem to be in agreement that for security reasons it is a
good idea to impose the restriction to a maximum code point of
0x10FFFF when encoding/decoding UTF-8.

The iconv program does not impose this restriction.  It is pretty
clear from reading the code in iconv/gconv_simple.c that it allows up
to six bytes for the UTF-8 encoding and does not check whether the
value is above 0x10FFFF.  I have encountered this in practice using
iconv to filter questionable data where iconv has let through illegal
code points.  For example, I have right now a file resulting from
"iconv -c -f UTF-8 -t UTF-8" which exhibits the illegal character
U+176DF8.

Can you please make iconv impose this restriction for UTF-8?

If needed, it might be okay to make "UTF-8" mean UTF-8 according to
Unicode and the RFC and allow another name (perhaps "UTF-8(ISO)"?)  to
mean the current unrestricted version, just in case someone
desperately needs the current UTF-8 support and (for some bizarre
reason!) its ability to encode values above 0x10FFFF.
Comment 1 Ulrich Drepper 2006-04-26 06:31:16 UTC
I don't agree at all.  There is no reason to possibly break someone's code. 
Nobody has ever shown any evidence why this is a bad idea.
Comment 2 Florian Weimer 2016-05-08 13:56:40 UTC
Current consensus appears to be that this is indeed a glibc bug.

I'm marking this particular encoding bug as security- because due to various factors (lack of wchar_t use, lack of defined character properties in astral planes), a security impact seems unlikely.
Comment 3 Bruno Haible 2016-11-17 22:40:42 UTC
Re comment 1:
> Nobody has ever shown any evidence why this is a bad idea.
Validing input is rule #1 among the "secure coding practices", e.g.
https://www.securecoding.cert.org/confluence/display/seccode/Top+10+Secure+Coding+Practices
Also frequently mentioned on http://cwe.mitre.org/top25/

Related attacks exist, e.g.
https://capec.mitre.org/data/definitions/80.html
Comment 4 Bruno Haible 2016-11-17 22:55:17 UTC
The related bug #19727 has been fixed.
Comment 5 Florian Weimer 2020-06-02 11:33:09 UTC
*** Bug 26034 has been marked as a duplicate of this bug. ***
Comment 6 Johannes Berg 2020-06-05 20:42:56 UTC
I was looking around for the ISO, but only found this:

https://unicode.org/L2/L2010/10038-fcd10646-main.pdf

which does in fact also specify only up to 0x10ffff. So maybe that *did* get settled, which the original report mentioned.


https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf

seems to say the same, and the unicode website says:

"This version of the Unicode Standard is also synchronized with ISO/IEC 10646:2020, sixth edition."
Comment 7 Andreas Schwab 2020-06-05 21:39:19 UTC
RFC 2044 defines UTF-8 as a 1-6 octet encoding, referencing ISO/IEC 10646-1:1993 as the source.  This was eventually updated by RFC 3629, which introduced the U+10FFFF limit, but citing ISO/IEC 10646-1:2000 as without that limit.
Comment 8 Johannes Berg 2020-06-06 21:53:42 UTC
Oh, ok. The original comment here seemed to imply that ISO was the last one to hold out for more space than the others.


To carry over some discussion from the bug I originally filed (which was since closed as duplicate in favour of this one):

This came up because Python does this conversion using mbstowcs() and/or mbrtowc(), but then later goes to check that valid characters were returned.

The python discussion is here:

https://bugs.python.org/issue35883


Note that this isn't just about the range, but also the RFC prohibits the surrogate pair reservations:


RFC 3629:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.


(Python internally may actually allow using this in an UTF-8-like encoded string [that they call utf-8b] to carry arbitrary bytes around.)
Comment 9 Florian Weimer 2020-06-07 05:37:58 UTC
(In reply to Andreas Schwab from comment #7)
> RFC 2044 defines UTF-8 as a 1-6 octet encoding, referencing ISO/IEC
> 10646-1:1993 as the source.  This was eventually updated by RFC 3629, which
> introduced the U+10FFFF limit, but citing ISO/IEC 10646-1:2000 as without
> that limit.

Where? I think RFC 3629 still claims that the six byte limit per codepoint does not exist, in section 10:

   Another security issue occurs when encoding to UTF-8: the ISO/IEC
   10646 description of UTF-8 allows encoding character numbers up to
   U+7FFFFFFF, yielding sequences of up to 6 bytes.  There is therefore
   a risk of buffer overflow if the range of character numbers is not
   explicitly limited to U+10FFFF or if buffer sizing doesn't take into
   account the possibility of 5- and 6-byte sequences.
Comment 10 Andreas Schwab 2020-06-07 06:17:48 UTC
How does that disagree with what I wrote?
Comment 11 Florian Weimer 2020-06-30 13:05:56 UTC
(In reply to Andreas Schwab from comment #7)
> RFC 2044 defines UTF-8 as a 1-6 octet encoding, referencing ISO/IEC
> 10646-1:1993 as the source.  This was eventually updated by RFC 3629, which
> introduced the U+10FFFF limit, but citing ISO/IEC 10646-1:2000 as without
> that limit.

This is very misleading. I have a copy of ISO/IEC 10646-1 : 2000(E), bought directly from ISO, and Annex D (which is normative) still specifies 7FFF FFFF as the maximum UCS-4 value.

If UTF-8 is restricted to 17 planes in ISO 10646, this restriction has been introduced in a later version of the standard.

I don't think this matters because RFC 3629 is a publicly accessible standard, so I think this is what we should follow anyway.
Comment 12 Andreas Schwab 2020-06-30 13:18:20 UTC
> This is very misleading. I have a copy of ISO/IEC 10646-1 : 2000(E), bought
> directly from ISO, and Annex D (which is normative) still specifies 7FFF
> FFFF as the maximum UCS-4 value.

In which way does that differ from what I wrote?
Comment 13 jsm-csl@polyomino.org.uk 2020-06-30 14:56:37 UTC
The limit was in ISO 10646 in the 2011 edition but not in the 2003 
edition.  
https://sourceware.org/legacy-ml/libc-alpha/2012-09/msg00112.html has my 
notes on a previous investigation of when the limit was introduced.