Unicode security

Reini Urban reini.urban@gmail.com
Mon Jan 10 13:31:26 GMT 2022


Hi

Just a heads up from unicode:
Now that gcc has joined the long list of supporters of insecure unicode
identifiers, which means that identifiers are not identifiable for
attackers abusing utf8 homoglyphs, spoofing  or even bidi, the chance is
higher for some real-world attacks. So far it was only D, clang (since 3.3)
and exotic languages (like nim, crystal) to support binary chunks as names,
such as the typical linux filesystem.

there's no problem with ld and bfd per se. bfd has its names named as
symbols, not identifiers. symbols are permitted to be unreadable and
unidentifiable binary chunk.
Problems are object files being used as ABI and inherently as API (via
headers, ffi's and linker scripts, .def files)

I outlined it here
https://github.com/rurban/libu8ident/blob/master/c23%2B%2Bproposal.md#12-issues-with-binutils-linkers-exported-identifiers
in my C23++ (and C23) proposal to follow the unicode security guidelines
for identifiers TR39. This is not yet finished, still in work to get some
stats and a better TR31 charset subset for XID's. (identifiers)

See eg. this C file:

#include <assert.h>
int  الناس = 0;

int الإء() {
    return  الناس;
}
int main() {
  int ير = 1;
  assert(ير == 1 );
  return الإء();
}

which can now be compiled with gcc-10. leading to different interpretations
in the c-preprocessor:
gcc cpp =>
# 2 "texts/arabic-1.c"
int \U00000627\U00000644\U00000646\U00000627\U00000633 = 0;
int \U00000627\U00000644\U00000625\U00000621() {
    return \U00000627\U00000644\U00000646\U00000627\U00000633;
}

i.e. interpretation as utf-8, converted to extended identifiers with \U
codepoints

in llvm/clang cpp:
# 2 "texts/arabic-1.c" 2
int الناس = 0;

int الإء() {
    return الناس;
}
ie. kept utf-8 asis. and its -emit-llvm does
@"\D8\A7\D9\84\D9\86\D8\A7\D8\B3" = dso_local global i32 0, align 4
@.str = private unnamed_addr constant [10 x i8] c"\D9\8A\D8\B1 == 1\00",
align 1
@.str.1 = private unnamed_addr constant [17 x i8] c"texts/arabic-1.c\00",
align 1
@__PRETTY_FUNCTION__.main = private unnamed_addr constant [11 x i8] c"int
main()\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @"\D8\A7\D9\84\D8\A5\D8\A1"() #0 {
  %1 = load i32, i32* @"\D8\A7\D9\84\D9\86\D8\A7\D8\B3", align 4
  ret i32 %1
}
...

keeping the UTF-8 bytes.

now to binutils:
 nm arabic-1.o
                 U __assert_fail
0000000000000010 T main
0000000000000000 T الإء
0000000000000000 B

of course, as utf-8 chars are kept asis. the exported functions can include
homoglyphs and if so will display all variants asis, and without unicode
tools you'll have no idea which is what.
but what if the object file was compiled with some compiler in the
SHIFT-JIS or KOI8
or even worse in utf-8 with cyrillic homoglyphic letters. A FFI or linker
will have a hard time linking to that.

So sooner or later some ELF/COFF/bla header field will be needed to state
the obvious:
name is UTF-8.
and sooner or later binutils will need to restrict its symbols to be
identifiable,
also as linux filesystems.
therefore I'll provide the utils for unicode security for identifiers here:
https://github.com/rurban/libu8ident
it's mostly a restriction for id_start and id_cont characters (from some
recommended scripts), to check for illegal combining marks, to check for
illegal mixed scripts, and to check for normalization issues.

bfd needs to find names and it could lookup names normalized. (e.g. NFC).
The C23++ standard has a proposal to demand NFC only, so most combinings
marks will become illegal, only NFC names are allowed.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1949r7.html (all
in favor)
what's standardized for C23++/C23 will be good enough for ld also, I
suppose. just that there's no -std=c23 flag or such.
and not even grep can search normalized strings yet. well, someones has to
start, and it will be C++. In all fairness, first was Java, than my cperl,
then Rust which did unicode support properly.

e.g. for binutils there will be a olint needed, linting object files for
un-identifiable names in objects and libraries.
with bfd/ld/objdump it could also start as a warning e.g., as the recent
gcc bidi warning.

I have now the following errors:
ENCODING, XID, SCRIPT, SCRIPTS, COMBINE, optional CONFUS.
ENCODING checks for illegal UTF-8 encodings.
XID checks for violations of TR31 character sets for identifiers. Allowed
IdentifierStatus (TR39) is a good set, but for C23 there will be a
different set.
SCRIPT checks for disallowed, uncommon scripts (languages) defined in TR39.
SCRIPTS checks against TR39 violations against a mixed scripts profile,
where the recommended profile is Moderately Restrictive or a C23 variant
C23_4, which allows Greek (math) letters together with Latin.
COMBINE checks against illegal combining mark sequences where the mark does
not fit the base char. (TR39)

CONFUS is just bikeshedding for cooperate language lawyers, but the rest
are real security problems.
-- 
Reini Urban


More information about the Binutils mailing list