Unicode update of width and other character properties

Thomas Wolff towo@towo.net
Thu Aug 17 11:03:00 GMT 2017


Am 08.08.2017 um 10:30 schrieb Corinna Vinschen:
> On Aug  7 21:18, Thomas Wolff wrote:
>> Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:
>>> On Aug  6 07:36, Thomas Wolff wrote:
>>>> Hi,
>>>> this is a proposal to update wcwidth and the character properties functions
>>>> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
>>>> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
>>>> as well as to simplify automatic generation of respective tables for an
>>>> easier update step.
>>>> Table size is moderate (using ranges for character properties) but there is
>>>> still an option to reduce the two big tables in size.
>>> As per the aforementioned discussion the table sizes are at least
>>> twice as big, so this should be done with all due caution towards
>>> the goals of smaller targets.
>> If I'm going to implement the packed versions, they will be even smaller
>> than the current tables.
>>
>> ...
>> how to produce the desired patch format/series.
> Just as with any other git-based project:
>
>    $ git co -b my-stuff
>    [hack, hack, hack]
>    $ git commit [in useful chunks]
>    $ git format-patch -X (X == number of commits)
>
>> And then the patch would be included here by email?
> Yes:
>
> $ git send-email --to="newlib@sourceware.org"
I'm attaching my patches here for assessment.
I have revised table handling further, using gcc bit struct packing. The 
two big tables have a total size of 14340 bytes now, for Unicode 10.0.
I have fixed locale handling in the isw* and tow* functions, but I've 
not yet changed JP conversion. Unfortunately, the routines from 
newlib/iconvdata are not as straight-forward to be employed as I 
thought, because the work on multi-byte representations.
Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets 
handled in iconvdata (JIS-201/208/212) is a little bit obscure.
Likewise obscure is the relation between newlib/iconvdata and 
newlib/libc/iconv.
To be on the safe side, I’m leaving the actual jp2uc conversion 
untouched for now, and I’ve just added a dummy back-conversion uc2jp 
with a #warning. If the #warning is ignored or removed, the non-Cygwin 
build should work as before, fixing just locale handling.

I'm attaching the wcwidth part here, all patches are available at 
http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the 
mailbox size limit).
Thomas

-------------- next part --------------
From 9c5d6b1adcf949269e3fceeaf31203921745d2c9 Mon Sep 17 00:00:00 2001
From: mintty <mintty@users.noreply.github.com>
Date: Mon, 14 Aug 2017 21:59:25 +0200
Subject: [PATCH 1/4] creation of width data, supporting Unicode updates

---
 newlib/libc/string/Makefile.widthdata |  47 +++
 newlib/libc/string/mkwide             |  49 +++
 newlib/libc/string/mkwidthA           |  20 +
 newlib/libc/string/uniset             | 678 ++++++++++++++++++++++++++++++++++
 4 files changed, 794 insertions(+)
 create mode 100644 newlib/libc/string/Makefile.widthdata
 create mode 100755 newlib/libc/string/mkwide
 create mode 100755 newlib/libc/string/mkwidthA
 create mode 100755 newlib/libc/string/uniset

diff --git a/newlib/libc/string/Makefile.widthdata b/newlib/libc/string/Makefile.widthdata
new file mode 100644
index 0000000..14adab5
--- /dev/null
+++ b/newlib/libc/string/Makefile.widthdata
@@ -0,0 +1,47 @@
+#############################################################################
+# generate Unicode width data for newlib/libc/string/wcwidth.c
+
+
+#############################################################################
+# table sets to be generated
+
+widthdata=combining.t ambiguous.t wide.t
+
+widthdata:	$(widthdata)
+
+
+#############################################################################
+# tools and data
+
+#WGET=wget -N -t 1 --timeout=55
+WGET=curl -R -O --connect-timeout 55
+WGET+=-z $@
+
+%.txt:
+	ln -s /usr/share/unicode/ucd/$@ . || $(WGET) http://unicode.org/Public/UNIDATA/$@
+
+uniset.tar.gz:
+	$(WGET) http://www.cl.cam.ac.uk/~mgk25/download/uniset.tar.gz
+
+uniset:	uniset.tar.gz
+	gzip -dc uniset.tar.gz | tar xvf - uniset
+
+
+#############################################################################
+# width data for libc/string/wcwidth.c
+
+combining.t:	uniset UnicodeData.txt Blocks.txt
+	PATH="${PATH}:." uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B +D7B0-D7C6 +D7CB-D7FB c > combining.t
+
+WIDTH-A:	uniset UnicodeData.txt Blocks.txt EastAsianWidth.txt
+	PATH="${PATH}:." sh ./mkwidthA
+
+ambiguous.t:	uniset WIDTH-A UnicodeData.txt Blocks.txt
+	PATH="${PATH}:." uniset +WIDTH-A -cat=Me -cat=Mn -cat=Cf c > ambiguous.t
+
+wide.t:	uniset UnicodeData.txt Blocks.txt EastAsianWidth.txt
+	PATH="${PATH}:." sh ./mkwide
+
+
+#############################################################################
+# end
diff --git a/newlib/libc/string/mkwide b/newlib/libc/string/mkwide
new file mode 100755
index 0000000..55a0bab
--- /dev/null
+++ b/newlib/libc/string/mkwide
@@ -0,0 +1,49 @@
+#! /bin/sh
+
+# generate list of wide characters, with convex closure
+
+skipcheck=false
+
+if [ ! -r EastAsianWidth.txt ]
+then	ln -s /usr/share/unicode/ucd/EastAsianWidth.txt . || exit 1
+fi
+if [ ! -r UnicodeData.txt ]
+then	ln -s /usr/share/unicode/ucd/UnicodeData.txt . || exit 1
+fi
+if [ ! -r Blocks.txt ]
+then	ln -s /usr/share/unicode/ucd/Blocks.txt . || exit 1
+fi
+
+sed -e "s,^\([^;]*\);[NAH],\1," -e t -e d EastAsianWidth.txt > wide.na
+sed -e "s,^\([^;]*\);[WF],\1," -e t -e d EastAsianWidth.txt > wide.fw
+
+PATH="$PATH:." # for uniset
+
+nrfw=`uniset +wide.fw nr | sed -e 's,.*:,,'`
+echo FW $nrfw
+nrna=`uniset +wide.na nr | sed -e 's,.*:,,'`
+echo NAH $nrna
+
+extrablocks="2E80-303E"
+
+# check all blocks
+includes () {
+	nr=`uniset +wide.$2 -$1 nr | sed -e 's,.*:,,'`
+	test $nr != $3
+}
+echo "adding compact closure of wide ranges, this may take ~10min"
+for b in $extrablocks `sed -e 's,^\([0-9A-F]*\)\.\.\([0-9A-F]*\).*,\1-\2,' -e t -e d Blocks.txt`
+do	range=$b
+	echo checking $range $* >&2
+	if includes $range fw $nrfw && ! includes $range na $nrna
+	then	echo $range
+	fi
+done > wide.blocks
+
+(
+sed -e "s,^,//," -e 1q EastAsianWidth.txt
+sed -e "s,^,//," -e 1q Blocks.txt
+uniset `sed -e 's,^,+,' wide.blocks` +wide.fw c
+) > wide.t
+
+rm -f wide.na wide.fw wide.blocks
diff --git a/newlib/libc/string/mkwidthA b/newlib/libc/string/mkwidthA
new file mode 100755
index 0000000..343ab40
--- /dev/null
+++ b/newlib/libc/string/mkwidthA
@@ -0,0 +1,20 @@
+#! /bin/sh
+
+# generate WIDTH-A file, listing Unicode characters with width property
+# Ambiguous, from EastAsianWidth.txt
+
+if [ ! -r EastAsianWidth.txt ]
+then	ln -s /usr/share/unicode/ucd/EastAsianWidth.txt . || exit 1
+fi
+if [ ! -r UnicodeData.txt ]
+then	ln -s /usr/share/unicode/ucd/UnicodeData.txt . || exit 1
+fi
+if [ ! -r Blocks.txt ]
+then	ln -s /usr/share/unicode/ucd/Blocks.txt . || exit 1
+fi
+
+sed -e "s,^\([^;]*\);A,\1," -e t -e d EastAsianWidth.txt > width-a-new
+rm -f WIDTH-A
+echo "# UAX #11: East Asian Ambiguous" > WIDTH-A
+PATH="$PATH:." uniset +width-a-new compact >> WIDTH-A
+rm -f width-a-new
diff --git a/newlib/libc/string/uniset b/newlib/libc/string/uniset
new file mode 100755
index 0000000..415e219
--- /dev/null
+++ b/newlib/libc/string/uniset
@@ -0,0 +1,678 @@
+#!/usr/bin/perl
+# Uniset -- Unicode subset manager -- Markus Kuhn
+# http://www.cl.cam.ac.uk/~mgk25/download/uniset.tar.gz
+# $Id: uniset,v 1.18 2004-04-10 21:19:39+01 mgk25 Exp mgk25 $
+
+require 5.008;
+use open ':utf8';
+
+binmode(STDOUT, ":utf8");
+binmode(STDIN, ":utf8");
+
+my (%name, %invname, %category, %comment);
+
+print <<End if $#ARGV < 0;
+Uniset -- Unicode subset manager -- Markus Kuhn
+
+Uniset allows to merge and subtract Unicode subsets. It can output and
+analyse the resulting set in various formats.
+
+The following commands can be supplied to uniset on the command line:
+
+Commands to define a set of characters:
+
+  + filename   add the character set described in the file to the set
+  - filename   remove the character set described in the file from the set
+  +: filename  add the characters in the UTF-8 file to the set
+  -: filename  remove the characters in the UTF-8 file from the set
+  +xxxx..yyyy  add the range to the set (xxxx and yyyy are hex numbers)
+  -xxxx..yyyy  remove the range from the set (xxxx and yyyy are hex numbers)
+  +cat=Xx      add all Unicode characters with category code Xx
+  -cat=Xx      remove all Unicode characters with category code Xx
+  -cat!=Xx     remove all Unicode characters without category code Xx
+  clean        remove any elements that do not appear in the Unicode database
+  unknown      remove any elements that do appear in the Unicode database
+
+Command to output descriptions of the constructed set of characters:
+
+  table        write a full table with one line per character
+  compact      output the set in compact MES format
+  c            output the set as C interval array
+  nr           output the number of characters
+  sources      output a table that shows the number of characters contributed
+               by the various combinations of input sets added with +.
+  utf8-list    output a list of all characters encoded in UTF-8
+
+Commands to tailor the following output commands:
+
+  html         write HTML tables instead of plain text
+  ucs          add the unicode character itself to the table (UTF-8 in
+               plain table, numeric character reference in HTML)
+
+Formats of character set input files read by the + and - command:
+
+Empty lines, white space at the start and end of the line and any
+comment text following a \# are ignored. The following formats are
+recognized
+
+xx yyyy             xx is the hex code in an 8-bit character set and yyyy
+                    is the corresponding Unicode value. Both can optionally
+                    be prefixed by 0x. This is the format used in the
+                    files on <ftp://ftp.unicode.org/Public/MAPPINGS/>.
+
+yyyy                yyyy (optionally prefixed with 0x) is a Unicode character
+                    belonging to the specified subset.
+
+yyyy-yyyy           a range of Unicode characters belonging to
+yyyy..yyyy          the specified subset.
+
+xx yy yy yy-yy yy   xx denotes a row (high-byte) and the yy specify
+                    corresponding low bytes or with a hyphen also ranges of
+                    low bytes in the Unicode values that belong to this
+                    subset. This is also the format that is generated by
+                    the compact command.
+End
+exit 1 if $#ARGV < 0;
+
+
+# Subroutine to identify whether the ISO 10646/Unicode character code
+# ucs belongs into the East Asian Wide (W) or East Asian FullWidth
+# (F) category as defined in Unicode Technical Report #11.
+
+sub iswide ($) {
+    my $ucs = shift(@_);
+
+    return ($ucs >= 0x1100 &&
+	    ($ucs <= 0x115f ||                     # Hangul Jamo
+	     $ucs == 0x2329 || $ucs == 0x232a ||
+	     ($ucs >= 0x2e80 && $ucs <= 0xa4cf &&
+	      $ucs != 0x303f) ||                   # CJK .. Yi
+	     ($ucs >= 0xac00 && $ucs <= 0xd7a3) || # Hangul Syllables
+	     ($ucs >= 0xf900 && $ucs <= 0xfaff) || # CJK Comp. Ideographs
+	     ($ucs >= 0xfe30 && $ucs <= 0xfe6f) || # CJK Comp. Forms
+	     ($ucs >= 0xff00 && $ucs <= 0xff60) || # Fullwidth Forms
+	     ($ucs >= 0xffe0 && $ucs <= 0xffe6) ||
+	     ($ucs >= 0x20000 && $ucs <= 0x2fffd) ||
+	     ($ucs >= 0x30000 && $ucs <= 0x3fffd)));
+}
+
+# Return the Unicode name that belongs to a given character code
+
+# Jamo short names, see Unicode 3.0, table 4-4, page 86
+
+my @lname = ('G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '',
+	     'J', 'JJ', 'C', 'K', 'T', 'P', 'H'); # 1100..1112
+my @vname = ('A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O',
+	     'WA', 'WAE', 'OE', 'YO', 'U', 'WEO', 'WE', 'WI', 'YU',
+	     'EU', 'YI', 'I'); # 1161..1175
+my @tname = ('G', 'GG', 'GS', 'N', 'NJ', 'NH', 'D', 'L', 'LG', 'LM',
+	     'LB', 'LS', 'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS',
+	     'NG', 'J', 'C', 'K', 'T', 'P', 'H'); # 11a8..11c2
+
+sub name {
+    my $ucs = shift(@_);
+    
+    # The intervals used here reflect Unicode Version 3.2
+    if (($ucs >=  0x3400 && $ucs <=  0x4db5) ||
+	($ucs >=  0x4e00 && $ucs <=  0x9fa5) ||
+	($ucs >= 0x20000 && $ucs <= 0x2a6d6)) {
+	return "CJK UNIFIED IDEOGRAPH-" . sprintf("%04X", $ucs);
+    }
+    
+    if ($ucs >= 0xac00 && $ucs <= 0xd7a3) {
+	my $s = $ucs - 0xac00;
+	my $l = 0x1100 + int($s / (21 * 28));
+	my $v = 0x1161 + int(($s % (21 * 28)) / 28);
+	my $t = 0x11a7 + $s % 28;
+	return "HANGUL SYLLABLE " . 
+	    ($lname[int($s / (21 * 28))] .
+	     $vname[int(($s % (21 * 28)) / 28)] .
+	     $tname[$s % 28 - 1]);
+    }
+    
+    return $name{$ucs};
+}
+
+sub is_unicode {
+    my $ucs = shift(@_);
+
+    # The intervals used here reflect Unicode Version 3.2
+    if (($ucs >=  0x3400 && $ucs <=  0x4db5) ||
+	($ucs >=  0x4e00 && $ucs <=  0x9fa5) ||
+	($ucs >=  0xac00 && $ucs <=  0xd7a3) ||
+	($ucs >= 0x20000 && $ucs <= 0x2a6d6)) {
+	return 1;
+    }
+    
+    return exists $name{$ucs};
+}
+
+
+my $html = 0;
+my $image = 0;
+my $adducs = 0;
+my $unicodedata = "UnicodeData.txt";
+my $blockdata = "Blocks.txt";
+my $datadir = "$ENV{HOME}/local/lib/ucs";
+
+# read list of all Unicode names
+if (!open(UDATA, $unicodedata) && !open(UDATA, "$datadir/$unicodedata")) {
+    die ("Can't open Unicode database '$unicodedata':\n$!\n\n" .
+	 "Please make sure that you have downloaded the file\n" .
+	 "ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt\n");
+}
+while (<UDATA>) {
+    if (/^([0-9,A-F]{4,8});([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*)$/) {
+	next if $2 ne '<control>' && substr($2, 0, 1) eq '<';
+	$ucs = hex($1);
+        $name{$ucs} = $2;
+	$invname{$2} = $ucs;
+	$category{$ucs} = $3;
+        $comment{$ucs} = $12;
+    } else {
+        die("Syntax error in line '$_' in file '$unicodedata'");
+    }
+}
+close(UDATA);
+
+# read list of all Unicode blocks
+if (!open(UDATA, $blockdata) && !open(UDATA, "$datadir/$blockdata")) {
+    die ("Can't open Unicode blockname list '$blockdata':\n$!\n\n" .
+	 "Please make sure that you have downloaded the file\n" .
+	 "ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt\n");
+}
+my $blocks = 0;
+my (@blockstart, @blockend, @blockname);
+while (<UDATA>) {
+    if (/^\s*([0-9,A-F]{4,8})\s*\.\.\s*([0-9,A-F]{4,8})\s*;\s*(.*)$/) {
+        $blockstart[$blocks] = hex($1);
+	$blockend  [$blocks] = hex($2);
+        $blockname [$blocks] = $3;
+	$blocks++;
+    } elsif (/^\s*\#/ || /^\s*$/) {
+	# ignore comments and empty lines
+    } else {
+        die("Syntax error in line '$_' in file '$blockdata'");
+    }
+}
+close(UDATA);
+if ($blockend[$blocks-1] < 0x110000) {
+    $blockstart[$blocks] = 0x110000;
+    $blockend  [$blocks] = 0x7FFFFFFF;
+    $blockname [$blocks] = "Beyond Plane 16";
+    $blocks++;
+}
+
+# process command line arguments
+while ($_ = shift(@ARGV)) {
+    if (/^html$/) {
+	$html = 1;
+    } elsif (/^ucs$/) {
+	$adducs = 1;
+    } elsif (/^img$/) {
+	$html = 1;
+	$image = 1;
+    } elsif (/^template$/) {
+	$template = shift(@ARGV);
+	open(TEMPLATE, $template) || die("Can't open template file '$template': '$!'");
+	while (<TEMPLATE>) {
+	    if (/^\#\s*include\s+\"([^\"]*)\"\s*$/) {
+		open(INCLUDE, $1) || die("Can't open template include file '$1': '$!'");
+		while (<INCLUDE>) {
+		    print $_;
+		}
+		close(INCLUDE);
+	    } elsif (/^\#\s*quote\s+\"([^\"]*)\"\s*$/) {
+		open(INCLUDE, $1) || die("Can't open template include file '$1': '$!'");
+		while (<INCLUDE>) {
+		    s/&/&/g;
+		    s/</</g;
+		    print $_;
+		}
+		close(INCLUDE);
+	    } else {
+		print $_;
+	    }
+	}
+	close(TEMPLATE);
+    } elsif (/^\+cat=(.+)$/) {
+	# add characters with given category
+	$cat = $1;
+	for $i (keys(%category)) {
+	    $used{$i} = "[${cat}]" if $category{$i} eq $cat;
+	}
+    } elsif (/^\-cat=(.+)$/) {
+	# remove characters with given category
+	$cat = $1;
+	for $i (keys(%category)) {
+	    delete $used{$i} if $category{$i} eq $cat;
+	}
+    } elsif (/^\-cat!=(.+)$/) {
+	# remove characters without given category
+	$cat = $1;
+	for $i (keys(%category)) {
+	    delete $used{$i} unless $category{$i} eq $cat;
+	}
+    } elsif (/^([+-]):(.*)/) {
+	$remove = $1 eq "-";
+	$setfile = $2;
+	$setfile = shift(@ARGV) if $setfile eq "";
+	push(@SETS, $setfile);
+	open(SET, $setfile) || die("Can't open set file '$setfile': '$!'");
+	$setname = $setfile;
+	while (<SET>) {
+	    while ($_) {
+		$i = ord($_);
+		$used{$i} .= "[${setname}]" unless $remove;
+		delete $used{$i} if $remove;
+		$_ = substr($_, 1);
+	    }
+	}
+	close SET;
+    } elsif (/^([+-])(.*)/) {
+	$remove = $1 eq "-";
+	$setfile = $2;
+	$setfile = "$setfile..$setfile" if $setfile =~ /^([0-9A-Fa-f]{4,8})$/;
+	if ($setfile =~ /^([0-9A-Fa-f]{4,8})(-|\.\.)([0-9A-Fa-f]{4,8})$/) {
+	    # handle intervall specification on command line
+	    $first = hex($1);
+	    $last = hex($3);
+	    for ($i = $first; $i <= $last; $i++) {
+		$used{$i} .= "[ARG]" unless $remove;
+		delete $used{$i} if $remove;
+	    }
+	    next;
+	}
+	$setfile = shift(@ARGV) if $setfile eq "";
+	push(@SETS, $setfile);
+	open(SET, $setfile) || die("Can't open set file '$setfile': '$!'");
+	$cedf = ($setfile =~ /cedf/); # detect Kosta Kosti's trans CEDF format by path name
+	$setname = $setfile;
+	$setname =~ s/([^.\[\]]*)\..*/$1/;
+	while (<SET>) {
+	    if (/^<code_set_name>/) {
+		# handle ISO 15897 (POSIX registry) charset mapping format
+		undef $comment_char;
+		undef $escape_char;
+		while (<SET>) {
+		    if ($comment_char && /^$comment_char/) {
+			# remove comments
+			$_ = $`;
+		    }
+		    next if (/^\032?\s*$/);                                             # skip empty lines
+		    if (/^<comment_char> (\S)$/) {
+			$comment_char = $1;
+		    } elsif (/^<escape_char> (\S)$/) {
+			$escape_char = $1;
+		    } elsif (/^(END )?CHARMAP$/) {
+			#ignore
+		    } elsif (/^<.*>\s*\/x([0-9A-F]{2})\s*<U([0-9A-F]{4,8})>/) {
+			$used{hex($2)} .= "[${setname}{$1}]" unless $remove;
+			delete $used{hex($2)} if $remove;
+		    } else {
+			die("Syntax error in line $. in file '$setfile':\n'$_'\n");
+		    }
+		}
+		next;
+	    } elsif (/^STARTFONT /) {
+		# handle X11 BDF file
+		while (<SET>) {
+		    if (/^ENCODING\s+([0-9]+)/) { 
+			$used{$1} .= "[${setname}]" unless $remove;
+			delete $used{$1} if $remove;
+		    }
+		}
+		next;
+	    }
+	    tr/a-z/A-Z/;           # make input uppercase
+	    if ($cedf) {
+		if ($. > 4) {
+		    if (/^([0-9A-F]{2})\t.?\t(.*)$/) {
+			# handle Kosta Kosti's trans CEDF format
+			next if (hex($1) < 32 || (hex($1) > 0x7e && hex($1) < 0xa0));
+			$ucs = $invname{$2};
+			die "unknown ISO 10646 name '$2' in '$setfile' line $..\n" if ! $ucs;
+			$used{$ucs} .= "[${setname}{$1}]" unless $remove;
+			delete $used{$ucs} if $remove;
+		    } else {
+			die("Syntax error in line $. in CEDF file '$setfile':\n'$_'\n");
+		    }
+		}
+		next;
+	    }
+	    if (/^\s*(0X|U\+|U-)?([0-9A-F]{2})\s+\#\s*UNDEFINED\s*$/) {
+		# ignore ftp.unicode.org mapping file lines with #UNDEFINED
+		next;
+	    }
+	    s/^([^\#]*)\#.*$/$1/;  # remove comments
+	    next if (/^\032?\s*$/);     # skip empty lines
+	    if (/^\s*(0X)?([0-9A-F-]{2})\s+(0X|U\+|U-)?([0-9A-F]{4,8})\s*$/) {
+		# handle entry from a ftp.unicode.org mapping file
+		$used{hex($4)} .= "[${setname}{$2}]" unless $remove;
+		delete $used{hex($4)} if $remove;
+	    } elsif (/^\s*(0X|U\+|U-)?([0-9A-F]{4,8})(\s*-\s*|\s*\.\.\s*|\s+)(0X|U\+|U-)?([0-9A-F]{4,8})\s*$/) {
+		# handle interval specification
+		$first = hex($2);
+		$last = hex($5);
+		for ($i = $first; $i <= $last; $i++) {
+		    $used{$i} .= "[${setname}]" unless $remove;
+		    delete $used{$i} if $remove;
+		}
+	    } elsif (/^\s*([0-9A-F]{2,6})(\s+[0-9A-F]{2},?|\s+[0-9A-F]{2}-[0-9A-F]{2},?)+/) {
+		# handle lines from P10 MES draft
+		$row = $1;
+		$cols = $_;
+		$cols =~ s/^\s*([0-9A-F]{2,6})\s*(.*)\s*$/$2/;
+		$cols =~ tr/,//d;
+		@cols = split(/\s+/, $cols);
+		for (@cols) {
+		    if (/^(..)$/) {
+			$first = hex("$row$1");
+			$last  = $first;
+		    } elsif (/^(..)-(..)$/) {
+			$first = hex("$row$1");
+			$last  = hex("$row$2");
+		    } else {
+			die ("this should never happen '$_'");
+		    }
+		    for ($i = $first; $i <= $last; $i++) {
+			$used{$i} .= "[${setname}]" unless $remove;
+			delete $used{$i} if $remove;
+		    }
+		}
+	    } elsif (/^\s*(0X|U\+|U-)?([0-9A-F]{4,8})\s*/) {
+		# handle single character
+		$used{hex($2)} .= "[${setname}]" unless $remove;
+		delete $used{hex($2)} if $remove;
+	    } else {
+		die("Syntax error in line $. in file '$setfile':\n'$_'\n") unless /^\s*(\#.*)?$/;
+	    }
+	}
+	close SET;
+    } elsif (/^loadimages$/ || /^loadbigimages$/) {
+	if (/^loadimages$/) {
+	    $prefix = "Small.Glyphs";
+	} else {
+	    $prefix = "Glyphs";
+	}
+	$total = 0;
+	for $i (keys(%used)) {
+	    next if ($name{$i} eq "<control>");
+	    $total++;
+	}
+	$count = 0;
+	$| = 1;
+	for $i (sort({$a <=> $b} keys(%used))) {
+	    next if ($name{$i} eq "<control>");
+	    $count++;
+	    $j = sprintf("%04X", $i);
+	    $j =~ /(..)(..)/;
+	    $gif = "http://charts.unicode.org/Unicode.charts/$prefix/$1/U$j.gif";
+	    print("\r$count/$total: $gif");
+	    system("mkdir -p $prefix/$1; cd $prefix/$1; webcopy -u -s $gif &");
+	    select(undef, undef, undef, 0.2);
+	}
+	print("\n");
+	exit 0;
+    } elsif (/^giftable/) {
+	# form a table of glyphs (requires pbmtools installed)
+	$count = 0;
+	for $i (keys(%used)) {
+	    $count++ unless $name{$i} eq "<control>";
+	}
+	$width = int(sqrt($count/sqrt(2)) + 0.5);
+	$width = $1 if /^giftable([0-9]+)$/;
+	system("rm -f tmp-*.pnm table.pnm~ table.pnm");
+	$col = 0;
+	$row = 0;
+	for $i (sort({$a <=> $b} keys(%used))) {
+	    next if ($name{$i} eq "<control>");
+	    $j = sprintf("%04X", $i);
+	    $j =~ /(..)(..)/;
+	    $gif = "Small.Glyphs/$1/U$j.gif";
+	    $pnm = sprintf("tmp-%02x.pnm", $col);
+	    $fallback = "Small.Glyphs/FF/UFFFD.gif";
+	    system("giftopnm $gif >$pnm || { rm $pnm ; giftopnm $fallback >$pnm ; }");
+	    if (++$col == $width) {
+		system("pnmcat -lr tmp-*.pnm | cat >tmp-row.pnm");
+		if ($row == 0) {
+		    system("mv tmp-row.pnm table.pnm");
+		} else {
+		    system("mv table.pnm table.pnm~; pnmcat -tb table.pnm~ tmp-row.pnm >table.pnm");
+		}
+		$row++;
+		$col = 0;
+		system("rm -f tmp-*.pnm table.pnm~");
+	    }
+	}
+	if ($col > 0) {
+	    system("pnmcat -lr tmp-*.pnm | cat >tmp-row.pnm");
+	    if ($row == 0) {
+		system("mv tmp-row.pnm table.pnm");
+	    } else {
+		system("mv table.pnm table.pnm~; pnmcat -tb -jleft -black table.pnm~ tmp-row.pnm >table.pnm");
+	    }
+	}
+	system("rm -f table.gif ; ppmtogif table.pnm > table.gif");
+	system("rm -f tmp-*.pnm table.pnm~ table.pnm");
+    } elsif (/^table$/) {
+	# go through all used names to print full table
+	print "<TABLE border=2>\n" if $html;
+	for $i (sort({$a <=> $b} keys(%used))) {
+	    next if ($name{$i} eq "<control>");
+	    if ($html) {
+		$sources = $used{$i};
+		$sources =~ s/\]\[/, /g;
+		$sources =~ s/^\[//g;
+		$sources =~ s/\]$//g;
+		$sources =~ s/\{(..)\}/<SUB>$1<\/SUB>/g;
+		$j = sprintf("%04X", $i);
+		$j =~ /(..)(..)/;
+		$gif = "Small.Glyphs/$1/U$j.gif";
+		print "<TR>";
+		print "<TD><img width=32 height=32 src=\"$gif\">" if $image;
+		printf("<TD>&#%d;", $i) if $adducs;
+		print "<TD><SAMP>$j</SAMP><TD><SAMP>" . name($i);
+		print " ($comment{$i})" if $comment{$i};
+		print "</SAMP><TD><SMALL>$sources</SMALL>\n";
+	    } else {
+		printf("%04X \# ", $i);
+		print pack("U", $i) . " " if $adducs;
+		print name($i) ."\n";
+	    }
+	}
+	print "</TABLE>\n" if $html;
+    } elsif (/^imgblock$/) {
+	$width = 16;
+	$width = $1 if /giftable([0-9]+)/;
+	$col = 0;
+	$subline = "";
+	print "\n<P><TABLE cellspacing=0 cellpadding=0>";
+	for $i (sort({$a <=> $b} keys(%used))) {
+	    print "<TR>" if $col == 0;
+	    $j = sprintf("%04X", $i);
+	    $j =~ /(..)(..)/;
+	    $gif = "Small.Glyphs/$1/U$j.gif";
+	    $alt = name($i);
+	    print "<TD><img width=32 height=32 src=\"$gif\" alt=\"$alt\">";
+	    $subline .= "<TD><SMALL><SAMP>$j</SAMP></SMALL>";
+	    if (++$col == $width) {
+		print "<TR align=center>$subline";
+		$col = 0;
+		$subline = "";
+	    }
+	}
+	print "<TR align=center>$subline" if ($col > 0);
+	print "</TABLE>\n";
+    } elsif (/^sources$/) {
+	# count how many characters are attributed to the various source set combinations
+	print "<P>Number of occurences of source character set combinations:\n<TABLE border=2>" if $html;
+	for $i (keys(%used)) {
+	    next if ($name{$i} eq "<control>");
+	    $sources = $used{$i};
+	    $sources =~ s/\]\[/, /g;
+	    $sources =~ s/^\[//g;
+	    $sources =~ s/\]$//g;
+	    $sources =~ s/\{(..)\}//g;
+	    $contribs{$sources} += 1;
+	}
+	for $j (keys(%contribs)) {
+	    print "<TR><TD>$contribs{$j}<TD>$j\n" if $html;
+	}
+	print "</TABLE>\n" if $html;
+    } elsif (/^compact$/) {
+	# print compact table in P10 MES format
+	print "<P>Compact representation of this character set:\n<TABLE border=2>" if $html;
+	print "<TR><TD><B>Rows</B><TD><B>Positions (Cells)</B>" if $html;
+	print "\n# Plane 00\n# Rows\tPositions (Cells)\n" unless $html;
+	$current_row = '';
+	$start_col = '';
+	$last_col = '';
+	for $i (sort({$a <=> $b} keys(%used))) {
+	    next if ($name{$i} eq "<control>");
+	    $row = sprintf("%02X", $i >> 8);
+	    $col = sprintf("%02X", $i & 0xff);
+	    if ($row ne $current_row) {
+		if (($last_col ne '') and ($last_col ne $start_col)) {
+		    print "-$last_col";
+		    print "</SAMP>" if $html;
+		}
+		print "<TR><TD><SAMP>$row</SAMP><TD><SAMP>" if $html;
+		print "\n  $row\t" unless $html;
+		$len = 0;
+		$current_row = $row;
+		$start_col = '';
+	    }
+	    if ($start_col eq '') {
+		print "$col";
+		$len += 2;
+		$start_col = $col;
+		$last_col = $col;
+	    } elsif (hex($col) == hex($last_col) + 1) {
+		$last_col = $col;
+	    } else {
+		if ($last_col ne $start_col) {
+		    print "-$last_col";
+		    $len += 3;
+		}
+		if ($len > 60 && !$html) {
+		    print "\n  $row\t";
+		    $len = 0;
+		};
+		print " " if $len;
+		print "$col";
+		$len += 2 + !! $len;
+		$start_col = $col;
+		$last_col = $col;
+	    }
+	}
+	if (($last_col ne '') and ($last_col ne $start_col)) {
+	    print "-$last_col";
+	    print "</SAMP>" if $html;
+	}
+	print "\n" if ($current_row ne '');
+	print "</TABLE>\n" if $html;
+	print "\n";
+    } elsif (/^c$/) {
+	# print table as C interval array
+	print "{";
+	$last_i = '';
+	$columns = 3;
+	$col = $columns;
+	for $i (sort({$a <=> $b} keys(%used))) {
+	    next if ($name{$i} eq "<control>");
+	    if ($last_i eq '') {
+		if (++$col > $columns) { $col = 1; print "\n "; }
+		printf(" { 0x%04X, ", $i);
+		$last_i = $i;
+	    } elsif ($i == $last_i + 1) {
+		$last_i = $i;
+	    } else {
+		printf("0x%04X },", $last_i);
+		if (++$col > $columns) { $col = 1; print "\n "; }
+		printf(" { 0x%04X, ", $i);
+		$last_i = $i;
+	    }
+	}
+	if ($last_i ne '') {
+	    printf("0x%04X }", $last_i);
+	}
+	print "\n};\n";
+    } elsif (/^utf8-list$/) {
+	$col = 0;
+	$block = 0;
+	$last = -1;
+	for $i (sort({$a <=> $b} keys(%used))) {
+	    next if ($name{$i} eq "<control>");
+	    while ($blockend[$block] < $i && $block < $blocks - 1) {
+		$block++;
+	    }
+	    if ($last <= $blockend[$block-1] &&
+		$i < $blockstart[$block]) {
+		print "\n" if ($col);
+		printf "\nFree block (U+%04X-U+%04X):\n\n",
+		    $blockend[$block-1] + 1, $blockstart[$block] - 1;
+		$col = 0;
+	    }
+	    if ($last < $blockstart[$block] && $i >= $blockstart[$block]) {
+		print "\n" if ($col);
+		printf "\n$blockname[$block] (U+%04X-U+%04X):\n\n",
+		$blockstart[$block], $blockend[$block];
+		$col = 0;
+	    }
+	    if ($category{$i} eq 'Mn') {
+		# prefix non-spacing character with U+25CC DOTTED CIRCLE
+		print "\x{25CC}";
+	    } elsif ($category{$i} eq 'Me') {
+		# prefix enclosing non-spacing character with space
+		print " ";
+	    }
+	    print pack("U", $i);
+	    $col += 1 + iswide($i);
+	    if ($col >= 64) {
+		print "\n";
+		$col = 0;
+	    }
+	    $last = $i;
+	}
+	print "\n" if ($col);
+    } elsif (/^collections$/) {
+	$block = 0;
+	$last = -1;
+	for $i (sort({$a <=> $b} keys(%used))) {
+	    next if ($name{$i} eq "<control>");
+	    while ($blockend[$block] < $i && $block < $blocks - 1) {
+		$block++;
+	    }
+	    if ($last < $blockstart[$block] && $i >= $blockstart[$block]) {
+		print $blockname[$block],
+		  " " x (40 - length($blockname[$block]));
+		printf "%04X-%04X\n",
+		  $blockstart[$block], $blockend[$block];
+	    }
+	    $last = $i;
+	}
+    } elsif (/^nr$/) {
+	print "<P>" if $html;
+	print "# " unless $html;
+	print "Number of characters in above table: ";
+	$count = 0;
+	for $i (keys(%used)) {
+	    $count++ unless $name{$i} eq "<control>";
+	}
+	print $count;
+	print "\n";
+    } elsif (/^clean$/) {
+	# remove characters from set that are not in $unicodedata
+	for $i (keys(%used)) {
+	    delete $used{$i} unless is_unicode($i);
+	}
+    } elsif (/^unknown$/) {
+	# remove characters from set that are in $unicodedata
+	for $i (keys(%used)) {
+	    delete $used{$i} if is_unicode($i);
+	}
+    } else {
+	die("Unknown command line command '$_'");
+    };
+}
-- 
2.13.2

-------------- next part --------------
From 00c7da38274b433f952a87732e58f2e22fc5229e Mon Sep 17 00:00:00 2001
From: mintty <mintty@users.noreply.github.com>
Date: Mon, 14 Aug 2017 22:00:44 +0200
Subject: [PATCH 2/4] generated width data, included in repository because of
 long creation time

---
 newlib/libc/string/WIDTH-A     | 569 +++++++++++++++++++++++++++++++++++++++++
 newlib/libc/string/ambiguous.t |  61 +++++
 newlib/libc/string/combining.t | 107 ++++++++
 newlib/libc/string/wide.t      |  33 +++
 4 files changed, 770 insertions(+)
 create mode 100644 newlib/libc/string/WIDTH-A
 create mode 100644 newlib/libc/string/ambiguous.t
 create mode 100644 newlib/libc/string/combining.t
 create mode 100644 newlib/libc/string/wide.t

diff --git a/newlib/libc/string/WIDTH-A b/newlib/libc/string/WIDTH-A
new file mode 100644
index 0000000..51e8f23
--- /dev/null
+++ b/newlib/libc/string/WIDTH-A
@@ -0,0 +1,569 @@
+# UAX #11: East Asian Ambiguous
+
+# Plane 00
+# Rows	Positions (Cells)
+
+  00	A1 A4 A7-A8 AA AD-AE B0-B4 B6-BA BC-BF C6 D0 D7-D8 DE-E1 E6 E8-EA
+  00	EC-ED F0 F2-F3 F7-FA FC FE
+  01	01 11 13 1B 26-27 2B 31-33 38 3F-42 44 48-4B 4D 52-53 66-67 6B
+  01	CE D0 D2 D4 D6 D8 DA DC
+  02	51 61 C4 C7 C9-CB CD D0 D8-DB DD DF
+  03	00-6F 91-A1 A3-A9 B1-C1 C3-C9
+  04	01 10-4F 51
+  20	10 13-16 18-19 1C-1D 20-22 24-27 30 32-33 35 3B 3E 74 7F 81-84
+  20	AC
+  21	03 05 09 13 16 21-22 26 2B 53-54 5B-5E 60-6B 70-79 89 90-99 B8-B9
+  21	D2 D4 E7
+  22	00 02-03 07-08 0B 0F 11 15 1A 1D-20 23 25 27-2C 2E 34-37 3C-3D
+  22	48 4C 52 60-61 64-67 6A-6B 6E-6F 82-83 86-87 95 99 A5 BF
+  23	12
+  24	60-E9 EB-FF
+  25	00-4B 50-73 80-8F 92-95 A0-A1 A3-A9 B2-B3 B6-B7 BC-BD C0-C1 C6-C8
+  25	CB CE-D1 E2-E5 EF
+  26	05-06 09 0E-0F 1C 1E 40 42 60-61 63-65 67-6A 6C-6D 6F 9E-9F BF
+  26	C6-CD CF-D3 D5-E1 E3 E8-E9 EB-F1 F4 F6-F9 FB-FC FE-FF
+  27	3D 76-7F
+  2B	56-59
+  32	48-4F
+  E0	00-FF
+  E1	00-FF
+  E2	00-FF
+  E3	00-FF
+  E4	00-FF
+  E5	00-FF
+  E6	00-FF
+  E7	00-FF
+  E8	00-FF
+  E9	00-FF
+  EA	00-FF
+  EB	00-FF
+  EC	00-FF
+  ED	00-FF
+  EE	00-FF
+  EF	00-FF
+  F0	00-FF
+  F1	00-FF
+  F2	00-FF
+  F3	00-FF
+  F4	00-FF
+  F5	00-FF
+  F6	00-FF
+  F7	00-FF
+  F8	00-FF
+  FE	00-0F
+  FF	FD
+  1F1	00-0A 10-2D 30-69 70-8D 8F-90 9B-AC
+  E01	00-EF
+  F00	00-FF
+  F01	00-FF
+  F02	00-FF
+  F03	00-FF
+  F04	00-FF
+  F05	00-FF
+  F06	00-FF
+  F07	00-FF
+  F08	00-FF
+  F09	00-FF
+  F0A	00-FF
+  F0B	00-FF
+  F0C	00-FF
+  F0D	00-FF
+  F0E	00-FF
+  F0F	00-FF
+  F10	00-FF
+  F11	00-FF
+  F12	00-FF
+  F13	00-FF
+  F14	00-FF
+  F15	00-FF
+  F16	00-FF
+  F17	00-FF
+  F18	00-FF
+  F19	00-FF
+  F1A	00-FF
+  F1B	00-FF
+  F1C	00-FF
+  F1D	00-FF
+  F1E	00-FF
+  F1F	00-FF
+  F20	00-FF
+  F21	00-FF
+  F22	00-FF
+  F23	00-FF
+  F24	00-FF
+  F25	00-FF
+  F26	00-FF
+  F27	00-FF
+  F28	00-FF
+  F29	00-FF
+  F2A	00-FF
+  F2B	00-FF
+  F2C	00-FF
+  F2D	00-FF
+  F2E	00-FF
+  F2F	00-FF
+  F30	00-FF
+  F31	00-FF
+  F32	00-FF
+  F33	00-FF
+  F34	00-FF
+  F35	00-FF
+  F36	00-FF
+  F37	00-FF
+  F38	00-FF
+  F39	00-FF
+  F3A	00-FF
+  F3B	00-FF
+  F3C	00-FF
+  F3D	00-FF
+  F3E	00-FF
+  F3F	00-FF
+  F40	00-FF
+  F41	00-FF
+  F42	00-FF
+  F43	00-FF
+  F44	00-FF
+  F45	00-FF
+  F46	00-FF
+  F47	00-FF
+  F48	00-FF
+  F49	00-FF
+  F4A	00-FF
+  F4B	00-FF
+  F4C	00-FF
+  F4D	00-FF
+  F4E	00-FF
+  F4F	00-FF
+  F50	00-FF
+  F51	00-FF
+  F52	00-FF
+  F53	00-FF
+  F54	00-FF
+  F55	00-FF
+  F56	00-FF
+  F57	00-FF
+  F58	00-FF
+  F59	00-FF
+  F5A	00-FF
+  F5B	00-FF
+  F5C	00-FF
+  F5D	00-FF
+  F5E	00-FF
+  F5F	00-FF
+  F60	00-FF
+  F61	00-FF
+  F62	00-FF
+  F63	00-FF
+  F64	00-FF
+  F65	00-FF
+  F66	00-FF
+  F67	00-FF
+  F68	00-FF
+  F69	00-FF
+  F6A	00-FF
+  F6B	00-FF
+  F6C	00-FF
+  F6D	00-FF
+  F6E	00-FF
+  F6F	00-FF
+  F70	00-FF
+  F71	00-FF
+  F72	00-FF
+  F73	00-FF
+  F74	00-FF
+  F75	00-FF
+  F76	00-FF
+  F77	00-FF
+  F78	00-FF
+  F79	00-FF
+  F7A	00-FF
+  F7B	00-FF
+  F7C	00-FF
+  F7D	00-FF
+  F7E	00-FF
+  F7F	00-FF
+  F80	00-FF
+  F81	00-FF
+  F82	00-FF
+  F83	00-FF
+  F84	00-FF
+  F85	00-FF
+  F86	00-FF
+  F87	00-FF
+  F88	00-FF
+  F89	00-FF
+  F8A	00-FF
+  F8B	00-FF
+  F8C	00-FF
+  F8D	00-FF
+  F8E	00-FF
+  F8F	00-FF
+  F90	00-FF
+  F91	00-FF
+  F92	00-FF
+  F93	00-FF
+  F94	00-FF
+  F95	00-FF
+  F96	00-FF
+  F97	00-FF
+  F98	00-FF
+  F99	00-FF
+  F9A	00-FF
+  F9B	00-FF
+  F9C	00-FF
+  F9D	00-FF
+  F9E	00-FF
+  F9F	00-FF
+  FA0	00-FF
+  FA1	00-FF
+  FA2	00-FF
+  FA3	00-FF
+  FA4	00-FF
+  FA5	00-FF
+  FA6	00-FF
+  FA7	00-FF
+  FA8	00-FF
+  FA9	00-FF
+  FAA	00-FF
+  FAB	00-FF
+  FAC	00-FF
+  FAD	00-FF
+  FAE	00-FF
+  FAF	00-FF
+  FB0	00-FF
+  FB1	00-FF
+  FB2	00-FF
+  FB3	00-FF
+  FB4	00-FF
+  FB5	00-FF
+  FB6	00-FF
+  FB7	00-FF
+  FB8	00-FF
+  FB9	00-FF
+  FBA	00-FF
+  FBB	00-FF
+  FBC	00-FF
+  FBD	00-FF
+  FBE	00-FF
+  FBF	00-FF
+  FC0	00-FF
+  FC1	00-FF
+  FC2	00-FF
+  FC3	00-FF
+  FC4	00-FF
+  FC5	00-FF
+  FC6	00-FF
+  FC7	00-FF
+  FC8	00-FF
+  FC9	00-FF
+  FCA	00-FF
+  FCB	00-FF
+  FCC	00-FF
+  FCD	00-FF
+  FCE	00-FF
+  FCF	00-FF
+  FD0	00-FF
+  FD1	00-FF
+  FD2	00-FF
+  FD3	00-FF
+  FD4	00-FF
+  FD5	00-FF
+  FD6	00-FF
+  FD7	00-FF
+  FD8	00-FF
+  FD9	00-FF
+  FDA	00-FF
+  FDB	00-FF
+  FDC	00-FF
+  FDD	00-FF
+  FDE	00-FF
+  FDF	00-FF
+  FE0	00-FF
+  FE1	00-FF
+  FE2	00-FF
+  FE3	00-FF
+  FE4	00-FF
+  FE5	00-FF
+  FE6	00-FF
+  FE7	00-FF
+  FE8	00-FF
+  FE9	00-FF
+  FEA	00-FF
+  FEB	00-FF
+  FEC	00-FF
+  FED	00-FF
+  FEE	00-FF
+  FEF	00-FF
+  FF0	00-FF
+  FF1	00-FF
+  FF2	00-FF
+  FF3	00-FF
+  FF4	00-FF
+  FF5	00-FF
+  FF6	00-FF
+  FF7	00-FF
+  FF8	00-FF
+  FF9	00-FF
+  FFA	00-FF
+  FFB	00-FF
+  FFC	00-FF
+  FFD	00-FF
+  FFE	00-FF
+  FFF	00-FD
+  1000	00-FF
+  1001	00-FF
+  1002	00-FF
+  1003	00-FF
+  1004	00-FF
+  1005	00-FF
+  1006	00-FF
+  1007	00-FF
+  1008	00-FF
+  1009	00-FF
+  100A	00-FF
+  100B	00-FF
+  100C	00-FF
+  100D	00-FF
+  100E	00-FF
+  100F	00-FF
+  1010	00-FF
+  1011	00-FF
+  1012	00-FF
+  1013	00-FF
+  1014	00-FF
+  1015	00-FF
+  1016	00-FF
+  1017	00-FF
+  1018	00-FF
+  1019	00-FF
+  101A	00-FF
+  101B	00-FF
+  101C	00-FF
+  101D	00-FF
+  101E	00-FF
+  101F	00-FF
+  1020	00-FF
+  1021	00-FF
+  1022	00-FF
+  1023	00-FF
+  1024	00-FF
+  1025	00-FF
+  1026	00-FF
+  1027	00-FF
+  1028	00-FF
+  1029	00-FF
+  102A	00-FF
+  102B	00-FF
+  102C	00-FF
+  102D	00-FF
+  102E	00-FF
+  102F	00-FF
+  1030	00-FF
+  1031	00-FF
+  1032	00-FF
+  1033	00-FF
+  1034	00-FF
+  1035	00-FF
+  1036	00-FF
+  1037	00-FF
+  1038	00-FF
+  1039	00-FF
+  103A	00-FF
+  103B	00-FF
+  103C	00-FF
+  103D	00-FF
+  103E	00-FF
+  103F	00-FF
+  1040	00-FF
+  1041	00-FF
+  1042	00-FF
+  1043	00-FF
+  1044	00-FF
+  1045	00-FF
+  1046	00-FF
+  1047	00-FF
+  1048	00-FF
+  1049	00-FF
+  104A	00-FF
+  104B	00-FF
+  104C	00-FF
+  104D	00-FF
+  104E	00-FF
+  104F	00-FF
+  1050	00-FF
+  1051	00-FF
+  1052	00-FF
+  1053	00-FF
+  1054	00-FF
+  1055	00-FF
+  1056	00-FF
+  1057	00-FF
+  1058	00-FF
+  1059	00-FF
+  105A	00-FF
+  105B	00-FF
+  105C	00-FF
+  105D	00-FF
+  105E	00-FF
+  105F	00-FF
+  1060	00-FF
+  1061	00-FF
+  1062	00-FF
+  1063	00-FF
+  1064	00-FF
+  1065	00-FF
+  1066	00-FF
+  1067	00-FF
+  1068	00-FF
+  1069	00-FF
+  106A	00-FF
+  106B	00-FF
+  106C	00-FF
+  106D	00-FF
+  106E	00-FF
+  106F	00-FF
+  1070	00-FF
+  1071	00-FF
+  1072	00-FF
+  1073	00-FF
+  1074	00-FF
+  1075	00-FF
+  1076	00-FF
+  1077	00-FF
+  1078	00-FF
+  1079	00-FF
+  107A	00-FF
+  107B	00-FF
+  107C	00-FF
+  107D	00-FF
+  107E	00-FF
+  107F	00-FF
+  1080	00-FF
+  1081	00-FF
+  1082	00-FF
+  1083	00-FF
+  1084	00-FF
+  1085	00-FF
+  1086	00-FF
+  1087	00-FF
+  1088	00-FF
+  1089	00-FF
+  108A	00-FF
+  108B	00-FF
+  108C	00-FF
+  108D	00-FF
+  108E	00-FF
+  108F	00-FF
+  1090	00-FF
+  1091	00-FF
+  1092	00-FF
+  1093	00-FF
+  1094	00-FF
+  1095	00-FF
+  1096	00-FF
+  1097	00-FF
+  1098	00-FF
+  1099	00-FF
+  109A	00-FF
+  109B	00-FF
+  109C	00-FF
+  109D	00-FF
+  109E	00-FF
+  109F	00-FF
+  10A0	00-FF
+  10A1	00-FF
+  10A2	00-FF
+  10A3	00-FF
+  10A4	00-FF
+  10A5	00-FF
+  10A6	00-FF
+  10A7	00-FF
+  10A8	00-FF
+  10A9	00-FF
+  10AA	00-FF
+  10AB	00-FF
+  10AC	00-FF
+  10AD	00-FF
+  10AE	00-FF
+  10AF	00-FF
+  10B0	00-FF
+  10B1	00-FF
+  10B2	00-FF
+  10B3	00-FF
+  10B4	00-FF
+  10B5	00-FF
+  10B6	00-FF
+  10B7	00-FF
+  10B8	00-FF
+  10B9	00-FF
+  10BA	00-FF
+  10BB	00-FF
+  10BC	00-FF
+  10BD	00-FF
+  10BE	00-FF
+  10BF	00-FF
+  10C0	00-FF
+  10C1	00-FF
+  10C2	00-FF
+  10C3	00-FF
+  10C4	00-FF
+  10C5	00-FF
+  10C6	00-FF
+  10C7	00-FF
+  10C8	00-FF
+  10C9	00-FF
+  10CA	00-FF
+  10CB	00-FF
+  10CC	00-FF
+  10CD	00-FF
+  10CE	00-FF
+  10CF	00-FF
+  10D0	00-FF
+  10D1	00-FF
+  10D2	00-FF
+  10D3	00-FF
+  10D4	00-FF
+  10D5	00-FF
+  10D6	00-FF
+  10D7	00-FF
+  10D8	00-FF
+  10D9	00-FF
+  10DA	00-FF
+  10DB	00-FF
+  10DC	00-FF
+  10DD	00-FF
+  10DE	00-FF
+  10DF	00-FF
+  10E0	00-FF
+  10E1	00-FF
+  10E2	00-FF
+  10E3	00-FF
+  10E4	00-FF
+  10E5	00-FF
+  10E6	00-FF
+  10E7	00-FF
+  10E8	00-FF
+  10E9	00-FF
+  10EA	00-FF
+  10EB	00-FF
+  10EC	00-FF
+  10ED	00-FF
+  10EE	00-FF
+  10EF	00-FF
+  10F0	00-FF
+  10F1	00-FF
+  10F2	00-FF
+  10F3	00-FF
+  10F4	00-FF
+  10F5	00-FF
+  10F6	00-FF
+  10F7	00-FF
+  10F8	00-FF
+  10F9	00-FF
+  10FA	00-FF
+  10FB	00-FF
+  10FC	00-FF
+  10FD	00-FF
+  10FE	00-FF
+  10FF	00-FD
+
diff --git a/newlib/libc/string/ambiguous.t b/newlib/libc/string/ambiguous.t
new file mode 100644
index 0000000..f8b7842
--- /dev/null
+++ b/newlib/libc/string/ambiguous.t
@@ -0,0 +1,61 @@
+{
+  { 0x00A1, 0x00A1 }, { 0x00A4, 0x00A4 }, { 0x00A7, 0x00A8 },
+  { 0x00AA, 0x00AA }, { 0x00AE, 0x00AE }, { 0x00B0, 0x00B4 },
+  { 0x00B6, 0x00BA }, { 0x00BC, 0x00BF }, { 0x00C6, 0x00C6 },
+  { 0x00D0, 0x00D0 }, { 0x00D7, 0x00D8 }, { 0x00DE, 0x00E1 },
+  { 0x00E6, 0x00E6 }, { 0x00E8, 0x00EA }, { 0x00EC, 0x00ED },
+  { 0x00F0, 0x00F0 }, { 0x00F2, 0x00F3 }, { 0x00F7, 0x00FA },
+  { 0x00FC, 0x00FC }, { 0x00FE, 0x00FE }, { 0x0101, 0x0101 },
+  { 0x0111, 0x0111 }, { 0x0113, 0x0113 }, { 0x011B, 0x011B },
+  { 0x0126, 0x0127 }, { 0x012B, 0x012B }, { 0x0131, 0x0133 },
+  { 0x0138, 0x0138 }, { 0x013F, 0x0142 }, { 0x0144, 0x0144 },
+  { 0x0148, 0x014B }, { 0x014D, 0x014D }, { 0x0152, 0x0153 },
+  { 0x0166, 0x0167 }, { 0x016B, 0x016B }, { 0x01CE, 0x01CE },
+  { 0x01D0, 0x01D0 }, { 0x01D2, 0x01D2 }, { 0x01D4, 0x01D4 },
+  { 0x01D6, 0x01D6 }, { 0x01D8, 0x01D8 }, { 0x01DA, 0x01DA },
+  { 0x01DC, 0x01DC }, { 0x0251, 0x0251 }, { 0x0261, 0x0261 },
+  { 0x02C4, 0x02C4 }, { 0x02C7, 0x02C7 }, { 0x02C9, 0x02CB },
+  { 0x02CD, 0x02CD }, { 0x02D0, 0x02D0 }, { 0x02D8, 0x02DB },
+  { 0x02DD, 0x02DD }, { 0x02DF, 0x02DF }, { 0x0391, 0x03A1 },
+  { 0x03A3, 0x03A9 }, { 0x03B1, 0x03C1 }, { 0x03C3, 0x03C9 },
+  { 0x0401, 0x0401 }, { 0x0410, 0x044F }, { 0x0451, 0x0451 },
+  { 0x2010, 0x2010 }, { 0x2013, 0x2016 }, { 0x2018, 0x2019 },
+  { 0x201C, 0x201D }, { 0x2020, 0x2022 }, { 0x2024, 0x2027 },
+  { 0x2030, 0x2030 }, { 0x2032, 0x2033 }, { 0x2035, 0x2035 },
+  { 0x203B, 0x203B }, { 0x203E, 0x203E }, { 0x2074, 0x2074 },
+  { 0x207F, 0x207F }, { 0x2081, 0x2084 }, { 0x20AC, 0x20AC },
+  { 0x2103, 0x2103 }, { 0x2105, 0x2105 }, { 0x2109, 0x2109 },
+  { 0x2113, 0x2113 }, { 0x2116, 0x2116 }, { 0x2121, 0x2122 },
+  { 0x2126, 0x2126 }, { 0x212B, 0x212B }, { 0x2153, 0x2154 },
+  { 0x215B, 0x215E }, { 0x2160, 0x216B }, { 0x2170, 0x2179 },
+  { 0x2189, 0x2189 }, { 0x2190, 0x2199 }, { 0x21B8, 0x21B9 },
+  { 0x21D2, 0x21D2 }, { 0x21D4, 0x21D4 }, { 0x21E7, 0x21E7 },
+  { 0x2200, 0x2200 }, { 0x2202, 0x2203 }, { 0x2207, 0x2208 },
+  { 0x220B, 0x220B }, { 0x220F, 0x220F }, { 0x2211, 0x2211 },
+  { 0x2215, 0x2215 }, { 0x221A, 0x221A }, { 0x221D, 0x2220 },
+  { 0x2223, 0x2223 }, { 0x2225, 0x2225 }, { 0x2227, 0x222C },
+  { 0x222E, 0x222E }, { 0x2234, 0x2237 }, { 0x223C, 0x223D },
+  { 0x2248, 0x2248 }, { 0x224C, 0x224C }, { 0x2252, 0x2252 },
+  { 0x2260, 0x2261 }, { 0x2264, 0x2267 }, { 0x226A, 0x226B },
+  { 0x226E, 0x226F }, { 0x2282, 0x2283 }, { 0x2286, 0x2287 },
+  { 0x2295, 0x2295 }, { 0x2299, 0x2299 }, { 0x22A5, 0x22A5 },
+  { 0x22BF, 0x22BF }, { 0x2312, 0x2312 }, { 0x2460, 0x24E9 },
+  { 0x24EB, 0x254B }, { 0x2550, 0x2573 }, { 0x2580, 0x258F },
+  { 0x2592, 0x2595 }, { 0x25A0, 0x25A1 }, { 0x25A3, 0x25A9 },
+  { 0x25B2, 0x25B3 }, { 0x25B6, 0x25B7 }, { 0x25BC, 0x25BD },
+  { 0x25C0, 0x25C1 }, { 0x25C6, 0x25C8 }, { 0x25CB, 0x25CB },
+  { 0x25CE, 0x25D1 }, { 0x25E2, 0x25E5 }, { 0x25EF, 0x25EF },
+  { 0x2605, 0x2606 }, { 0x2609, 0x2609 }, { 0x260E, 0x260F },
+  { 0x261C, 0x261C }, { 0x261E, 0x261E }, { 0x2640, 0x2640 },
+  { 0x2642, 0x2642 }, { 0x2660, 0x2661 }, { 0x2663, 0x2665 },
+  { 0x2667, 0x266A }, { 0x266C, 0x266D }, { 0x266F, 0x266F },
+  { 0x269E, 0x269F }, { 0x26BF, 0x26BF }, { 0x26C6, 0x26CD },
+  { 0x26CF, 0x26D3 }, { 0x26D5, 0x26E1 }, { 0x26E3, 0x26E3 },
+  { 0x26E8, 0x26E9 }, { 0x26EB, 0x26F1 }, { 0x26F4, 0x26F4 },
+  { 0x26F6, 0x26F9 }, { 0x26FB, 0x26FC }, { 0x26FE, 0x26FF },
+  { 0x273D, 0x273D }, { 0x2776, 0x277F }, { 0x2B56, 0x2B59 },
+  { 0x3248, 0x324F }, { 0xE000, 0xF8FF }, { 0xFFFD, 0xFFFD },
+  { 0x1F100, 0x1F10A }, { 0x1F110, 0x1F12D }, { 0x1F130, 0x1F169 },
+  { 0x1F170, 0x1F18D }, { 0x1F18F, 0x1F190 }, { 0x1F19B, 0x1F1AC },
+  { 0xF0000, 0xFFFFD }, { 0x100000, 0x10FFFD }
+};
diff --git a/newlib/libc/string/combining.t b/newlib/libc/string/combining.t
new file mode 100644
index 0000000..629d8f8
--- /dev/null
+++ b/newlib/libc/string/combining.t
@@ -0,0 +1,107 @@
+{
+  { 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
+  { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
+  { 0x05C7, 0x05C7 }, { 0x0600, 0x0605 }, { 0x0610, 0x061A },
+  { 0x061C, 0x061C }, { 0x064B, 0x065F }, { 0x0670, 0x0670 },
+  { 0x06D6, 0x06DD }, { 0x06DF, 0x06E4 }, { 0x06E7, 0x06E8 },
+  { 0x06EA, 0x06ED }, { 0x070F, 0x070F }, { 0x0711, 0x0711 },
+  { 0x0730, 0x074A }, { 0x07A6, 0x07B0 }, { 0x07EB, 0x07F3 },
+  { 0x0816, 0x0819 }, { 0x081B, 0x0823 }, { 0x0825, 0x0827 },
+  { 0x0829, 0x082D }, { 0x0859, 0x085B }, { 0x08D4, 0x0902 },
+  { 0x093A, 0x093A }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
+  { 0x094D, 0x094D }, { 0x0951, 0x0957 }, { 0x0962, 0x0963 },
+  { 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 },
+  { 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 },
+  { 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 },
+  { 0x0A4B, 0x0A4D }, { 0x0A51, 0x0A51 }, { 0x0A70, 0x0A71 },
+  { 0x0A75, 0x0A75 }, { 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC },
+  { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD },
+  { 0x0AE2, 0x0AE3 }, { 0x0AFA, 0x0AFF }, { 0x0B01, 0x0B01 },
+  { 0x0B3C, 0x0B3C }, { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B44 },
+  { 0x0B4D, 0x0B4D }, { 0x0B56, 0x0B56 }, { 0x0B62, 0x0B63 },
+  { 0x0B82, 0x0B82 }, { 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD },
+  { 0x0C00, 0x0C00 }, { 0x0C3E, 0x0C40 }, { 0x0C46, 0x0C48 },
+  { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 }, { 0x0C62, 0x0C63 },
+  { 0x0C81, 0x0C81 }, { 0x0CBC, 0x0CBC }, { 0x0CBF, 0x0CBF },
+  { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD }, { 0x0CE2, 0x0CE3 },
+  { 0x0D00, 0x0D01 }, { 0x0D3B, 0x0D3C }, { 0x0D41, 0x0D44 },
+  { 0x0D4D, 0x0D4D }, { 0x0D62, 0x0D63 }, { 0x0DCA, 0x0DCA },
+  { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 }, { 0x0E31, 0x0E31 },
+  { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E }, { 0x0EB1, 0x0EB1 },
+  { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC }, { 0x0EC8, 0x0ECD },
+  { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 }, { 0x0F37, 0x0F37 },
+  { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E }, { 0x0F80, 0x0F84 },
+  { 0x0F86, 0x0F87 }, { 0x0F8D, 0x0F97 }, { 0x0F99, 0x0FBC },
+  { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 }, { 0x1032, 0x1037 },
+  { 0x1039, 0x103A }, { 0x103D, 0x103E }, { 0x1058, 0x1059 },
+  { 0x105E, 0x1060 }, { 0x1071, 0x1074 }, { 0x1082, 0x1082 },
+  { 0x1085, 0x1086 }, { 0x108D, 0x108D }, { 0x109D, 0x109D },
+  { 0x1160, 0x11FF }, { 0x135D, 0x135F }, { 0x1712, 0x1714 },
+  { 0x1732, 0x1734 }, { 0x1752, 0x1753 }, { 0x1772, 0x1773 },
+  { 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD }, { 0x17C6, 0x17C6 },
+  { 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD }, { 0x180B, 0x180E },
+  { 0x1885, 0x1886 }, { 0x18A9, 0x18A9 }, { 0x1920, 0x1922 },
+  { 0x1927, 0x1928 }, { 0x1932, 0x1932 }, { 0x1939, 0x193B },
+  { 0x1A17, 0x1A18 }, { 0x1A1B, 0x1A1B }, { 0x1A56, 0x1A56 },
+  { 0x1A58, 0x1A5E }, { 0x1A60, 0x1A60 }, { 0x1A62, 0x1A62 },
+  { 0x1A65, 0x1A6C }, { 0x1A73, 0x1A7C }, { 0x1A7F, 0x1A7F },
+  { 0x1AB0, 0x1ABE }, { 0x1B00, 0x1B03 }, { 0x1B34, 0x1B34 },
+  { 0x1B36, 0x1B3A }, { 0x1B3C, 0x1B3C }, { 0x1B42, 0x1B42 },
+  { 0x1B6B, 0x1B73 }, { 0x1B80, 0x1B81 }, { 0x1BA2, 0x1BA5 },
+  { 0x1BA8, 0x1BA9 }, { 0x1BAB, 0x1BAD }, { 0x1BE6, 0x1BE6 },
+  { 0x1BE8, 0x1BE9 }, { 0x1BED, 0x1BED }, { 0x1BEF, 0x1BF1 },
+  { 0x1C2C, 0x1C33 }, { 0x1C36, 0x1C37 }, { 0x1CD0, 0x1CD2 },
+  { 0x1CD4, 0x1CE0 }, { 0x1CE2, 0x1CE8 }, { 0x1CED, 0x1CED },
+  { 0x1CF4, 0x1CF4 }, { 0x1CF8, 0x1CF9 }, { 0x1DC0, 0x1DF9 },
+  { 0x1DFB, 0x1DFF }, { 0x200B, 0x200F }, { 0x202A, 0x202E },
+  { 0x2060, 0x2064 }, { 0x2066, 0x206F }, { 0x20D0, 0x20F0 },
+  { 0x2CEF, 0x2CF1 }, { 0x2D7F, 0x2D7F }, { 0x2DE0, 0x2DFF },
+  { 0x302A, 0x302D }, { 0x3099, 0x309A }, { 0xA66F, 0xA672 },
+  { 0xA674, 0xA67D }, { 0xA69E, 0xA69F }, { 0xA6F0, 0xA6F1 },
+  { 0xA802, 0xA802 }, { 0xA806, 0xA806 }, { 0xA80B, 0xA80B },
+  { 0xA825, 0xA826 }, { 0xA8C4, 0xA8C5 }, { 0xA8E0, 0xA8F1 },
+  { 0xA926, 0xA92D }, { 0xA947, 0xA951 }, { 0xA980, 0xA982 },
+  { 0xA9B3, 0xA9B3 }, { 0xA9B6, 0xA9B9 }, { 0xA9BC, 0xA9BC },
+  { 0xA9E5, 0xA9E5 }, { 0xAA29, 0xAA2E }, { 0xAA31, 0xAA32 },
+  { 0xAA35, 0xAA36 }, { 0xAA43, 0xAA43 }, { 0xAA4C, 0xAA4C },
+  { 0xAA7C, 0xAA7C }, { 0xAAB0, 0xAAB0 }, { 0xAAB2, 0xAAB4 },
+  { 0xAAB7, 0xAAB8 }, { 0xAABE, 0xAABF }, { 0xAAC1, 0xAAC1 },
+  { 0xAAEC, 0xAAED }, { 0xAAF6, 0xAAF6 }, { 0xABE5, 0xABE5 },
+  { 0xABE8, 0xABE8 }, { 0xABED, 0xABED }, { 0xD7B0, 0xD7C6 },
+  { 0xD7CB, 0xD7FB }, { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F },
+  { 0xFE20, 0xFE2F }, { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB },
+  { 0x101FD, 0x101FD }, { 0x102E0, 0x102E0 }, { 0x10376, 0x1037A },
+  { 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }, { 0x10A0C, 0x10A0F },
+  { 0x10A38, 0x10A3A }, { 0x10A3F, 0x10A3F }, { 0x10AE5, 0x10AE6 },
+  { 0x11001, 0x11001 }, { 0x11038, 0x11046 }, { 0x1107F, 0x11081 },
+  { 0x110B3, 0x110B6 }, { 0x110B9, 0x110BA }, { 0x110BD, 0x110BD },
+  { 0x11100, 0x11102 }, { 0x11127, 0x1112B }, { 0x1112D, 0x11134 },
+  { 0x11173, 0x11173 }, { 0x11180, 0x11181 }, { 0x111B6, 0x111BE },
+  { 0x111CA, 0x111CC }, { 0x1122F, 0x11231 }, { 0x11234, 0x11234 },
+  { 0x11236, 0x11237 }, { 0x1123E, 0x1123E }, { 0x112DF, 0x112DF },
+  { 0x112E3, 0x112EA }, { 0x11300, 0x11301 }, { 0x1133C, 0x1133C },
+  { 0x11340, 0x11340 }, { 0x11366, 0x1136C }, { 0x11370, 0x11374 },
+  { 0x11438, 0x1143F }, { 0x11442, 0x11444 }, { 0x11446, 0x11446 },
+  { 0x114B3, 0x114B8 }, { 0x114BA, 0x114BA }, { 0x114BF, 0x114C0 },
+  { 0x114C2, 0x114C3 }, { 0x115B2, 0x115B5 }, { 0x115BC, 0x115BD },
+  { 0x115BF, 0x115C0 }, { 0x115DC, 0x115DD }, { 0x11633, 0x1163A },
+  { 0x1163D, 0x1163D }, { 0x1163F, 0x11640 }, { 0x116AB, 0x116AB },
+  { 0x116AD, 0x116AD }, { 0x116B0, 0x116B5 }, { 0x116B7, 0x116B7 },
+  { 0x1171D, 0x1171F }, { 0x11722, 0x11725 }, { 0x11727, 0x1172B },
+  { 0x11A01, 0x11A06 }, { 0x11A09, 0x11A0A }, { 0x11A33, 0x11A38 },
+  { 0x11A3B, 0x11A3E }, { 0x11A47, 0x11A47 }, { 0x11A51, 0x11A56 },
+  { 0x11A59, 0x11A5B }, { 0x11A8A, 0x11A96 }, { 0x11A98, 0x11A99 },
+  { 0x11C30, 0x11C36 }, { 0x11C38, 0x11C3D }, { 0x11C3F, 0x11C3F },
+  { 0x11C92, 0x11CA7 }, { 0x11CAA, 0x11CB0 }, { 0x11CB2, 0x11CB3 },
+  { 0x11CB5, 0x11CB6 }, { 0x11D31, 0x11D36 }, { 0x11D3A, 0x11D3A },
+  { 0x11D3C, 0x11D3D }, { 0x11D3F, 0x11D45 }, { 0x11D47, 0x11D47 },
+  { 0x16AF0, 0x16AF4 }, { 0x16B30, 0x16B36 }, { 0x16F8F, 0x16F92 },
+  { 0x1BC9D, 0x1BC9E }, { 0x1BCA0, 0x1BCA3 }, { 0x1D167, 0x1D169 },
+  { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD },
+  { 0x1D242, 0x1D244 }, { 0x1DA00, 0x1DA36 }, { 0x1DA3B, 0x1DA6C },
+  { 0x1DA75, 0x1DA75 }, { 0x1DA84, 0x1DA84 }, { 0x1DA9B, 0x1DA9F },
+  { 0x1DAA1, 0x1DAAF }, { 0x1E000, 0x1E006 }, { 0x1E008, 0x1E018 },
+  { 0x1E01B, 0x1E021 }, { 0x1E023, 0x1E024 }, { 0x1E026, 0x1E02A },
+  { 0x1E8D0, 0x1E8D6 }, { 0x1E944, 0x1E94A }, { 0xE0001, 0xE0001 },
+  { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
+};
diff --git a/newlib/libc/string/wide.t b/newlib/libc/string/wide.t
new file mode 100644
index 0000000..8d0e243
--- /dev/null
+++ b/newlib/libc/string/wide.t
@@ -0,0 +1,33 @@
+//# EastAsianWidth-10.0.0.txt
+//# Blocks-10.0.0.txt
+{
+  { 0x1100, 0x115F }, { 0x231A, 0x231B }, { 0x2329, 0x232A },
+  { 0x23E9, 0x23EC }, { 0x23F0, 0x23F0 }, { 0x23F3, 0x23F3 },
+  { 0x25FD, 0x25FE }, { 0x2614, 0x2615 }, { 0x2648, 0x2653 },
+  { 0x267F, 0x267F }, { 0x2693, 0x2693 }, { 0x26A1, 0x26A1 },
+  { 0x26AA, 0x26AB }, { 0x26BD, 0x26BE }, { 0x26C4, 0x26C5 },
+  { 0x26CE, 0x26CE }, { 0x26D4, 0x26D4 }, { 0x26EA, 0x26EA },
+  { 0x26F2, 0x26F3 }, { 0x26F5, 0x26F5 }, { 0x26FA, 0x26FA },
+  { 0x26FD, 0x26FD }, { 0x2705, 0x2705 }, { 0x270A, 0x270B },
+  { 0x2728, 0x2728 }, { 0x274C, 0x274C }, { 0x274E, 0x274E },
+  { 0x2753, 0x2755 }, { 0x2757, 0x2757 }, { 0x2795, 0x2797 },
+  { 0x27B0, 0x27B0 }, { 0x27BF, 0x27BF }, { 0x2B1B, 0x2B1C },
+  { 0x2B50, 0x2B50 }, { 0x2B55, 0x2B55 }, { 0x2E80, 0x303E },
+  { 0x3040, 0x321E }, { 0x3220, 0x3247 }, { 0x3250, 0x32FE },
+  { 0x3300, 0x4DBF }, { 0x4E00, 0xA4CF }, { 0xA960, 0xA97F },
+  { 0xAC00, 0xD7AF }, { 0xF900, 0xFAFF }, { 0xFE10, 0xFE1F },
+  { 0xFE30, 0xFE6F }, { 0xFF01, 0xFF60 }, { 0xFFE0, 0xFFE6 },
+  { 0x16FE0, 0x18AFF }, { 0x1B000, 0x1B12F }, { 0x1B170, 0x1B2FF },
+  { 0x1F004, 0x1F004 }, { 0x1F0CF, 0x1F0CF }, { 0x1F18E, 0x1F18E },
+  { 0x1F191, 0x1F19A }, { 0x1F200, 0x1F320 }, { 0x1F32D, 0x1F335 },
+  { 0x1F337, 0x1F37C }, { 0x1F37E, 0x1F393 }, { 0x1F3A0, 0x1F3CA },
+  { 0x1F3CF, 0x1F3D3 }, { 0x1F3E0, 0x1F3F0 }, { 0x1F3F4, 0x1F3F4 },
+  { 0x1F3F8, 0x1F43E }, { 0x1F440, 0x1F440 }, { 0x1F442, 0x1F4FC },
+  { 0x1F4FF, 0x1F53D }, { 0x1F54B, 0x1F54E }, { 0x1F550, 0x1F567 },
+  { 0x1F57A, 0x1F57A }, { 0x1F595, 0x1F596 }, { 0x1F5A4, 0x1F5A4 },
+  { 0x1F5FB, 0x1F64F }, { 0x1F680, 0x1F6C5 }, { 0x1F6CC, 0x1F6CC },
+  { 0x1F6D0, 0x1F6D2 }, { 0x1F6EB, 0x1F6EC }, { 0x1F6F4, 0x1F6F8 },
+  { 0x1F910, 0x1F93E }, { 0x1F940, 0x1F94C }, { 0x1F950, 0x1F96B },
+  { 0x1F980, 0x1F997 }, { 0x1F9C0, 0x1F9C0 }, { 0x1F9D0, 0x1F9E6 },
+  { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
+};
-- 
2.13.2

-------------- next part --------------
From 5d73691295b0013d78c1ce7c7ab0b0be0549d754 Mon Sep 17 00:00:00 2001
From: mintty <mintty@users.noreply.github.com>
Date: Mon, 14 Aug 2017 22:01:01 +0200
Subject: [PATCH 3/4] use generated width data

---
 newlib/libc/string/wcwidth.c | 146 +++++++------------------------------------
 1 file changed, 22 insertions(+), 124 deletions(-)

diff --git a/newlib/libc/string/wcwidth.c b/newlib/libc/string/wcwidth.c
index ac5c47f..73c036a 100644
--- a/newlib/libc/string/wcwidth.c
+++ b/newlib/libc/string/wcwidth.c
@@ -7,18 +7,18 @@ INDEX
 
 ANSI_SYNOPSIS
 	#include <wchar.h>
-	int wcwidth(const wchar_t <[wc]>);
+	int wcwidth(const wint_t <[wc]>);
 
 TRAD_SYNOPSIS
 	#include <wchar.h>
 	int wcwidth(<[wc]>)
-	wchar_t *<[wc]>;
+	wint_t *<[wc]>;
 
 DESCRIPTION
 	The <<wcwidth>> function shall determine the number of column
 	positions required for the wide character <[wc]>. The application
 	shall ensure that the value of <[wc]> is a character representable
-	as a wchar_t, and is a wide-character code corresponding to a
+	as a wint_t, and is a wide-character code corresponding to a
 	valid character in the current locale.
 
 RETURNS
@@ -174,112 +174,18 @@ _DEFUN (__wcwidth, (ucs),
 #ifdef _MB_CAPABLE
   /* sorted list of non-overlapping intervals of East Asian Ambiguous
    * characters, generated by "uniset +WIDTH-A -cat=Me -cat=Mn -cat=Cf c" */
-  static const struct interval ambiguous[] = {
-    { 0x00A1, 0x00A1 }, { 0x00A4, 0x00A4 }, { 0x00A7, 0x00A8 },
-    { 0x00AA, 0x00AA }, { 0x00AE, 0x00AE }, { 0x00B0, 0x00B4 },
-    { 0x00B6, 0x00BA }, { 0x00BC, 0x00BF }, { 0x00C6, 0x00C6 },
-    { 0x00D0, 0x00D0 }, { 0x00D7, 0x00D8 }, { 0x00DE, 0x00E1 },
-    { 0x00E6, 0x00E6 }, { 0x00E8, 0x00EA }, { 0x00EC, 0x00ED },
-    { 0x00F0, 0x00F0 }, { 0x00F2, 0x00F3 }, { 0x00F7, 0x00FA },
-    { 0x00FC, 0x00FC }, { 0x00FE, 0x00FE }, { 0x0101, 0x0101 },
-    { 0x0111, 0x0111 }, { 0x0113, 0x0113 }, { 0x011B, 0x011B },
-    { 0x0126, 0x0127 }, { 0x012B, 0x012B }, { 0x0131, 0x0133 },
-    { 0x0138, 0x0138 }, { 0x013F, 0x0142 }, { 0x0144, 0x0144 },
-    { 0x0148, 0x014B }, { 0x014D, 0x014D }, { 0x0152, 0x0153 },
-    { 0x0166, 0x0167 }, { 0x016B, 0x016B }, { 0x01CE, 0x01CE },
-    { 0x01D0, 0x01D0 }, { 0x01D2, 0x01D2 }, { 0x01D4, 0x01D4 },
-    { 0x01D6, 0x01D6 }, { 0x01D8, 0x01D8 }, { 0x01DA, 0x01DA },
-    { 0x01DC, 0x01DC }, { 0x0251, 0x0251 }, { 0x0261, 0x0261 },
-    { 0x02C4, 0x02C4 }, { 0x02C7, 0x02C7 }, { 0x02C9, 0x02CB },
-    { 0x02CD, 0x02CD }, { 0x02D0, 0x02D0 }, { 0x02D8, 0x02DB },
-    { 0x02DD, 0x02DD }, { 0x02DF, 0x02DF }, { 0x0391, 0x03A1 },
-    { 0x03A3, 0x03A9 }, { 0x03B1, 0x03C1 }, { 0x03C3, 0x03C9 },
-    { 0x0401, 0x0401 }, { 0x0410, 0x044F }, { 0x0451, 0x0451 },
-    { 0x2010, 0x2010 }, { 0x2013, 0x2016 }, { 0x2018, 0x2019 },
-    { 0x201C, 0x201D }, { 0x2020, 0x2022 }, { 0x2024, 0x2027 },
-    { 0x2030, 0x2030 }, { 0x2032, 0x2033 }, { 0x2035, 0x2035 },
-    { 0x203B, 0x203B }, { 0x203E, 0x203E }, { 0x2074, 0x2074 },
-    { 0x207F, 0x207F }, { 0x2081, 0x2084 }, { 0x20AC, 0x20AC },
-    { 0x2103, 0x2103 }, { 0x2105, 0x2105 }, { 0x2109, 0x2109 },
-    { 0x2113, 0x2113 }, { 0x2116, 0x2116 }, { 0x2121, 0x2122 },
-    { 0x2126, 0x2126 }, { 0x212B, 0x212B }, { 0x2153, 0x2154 },
-    { 0x215B, 0x215E }, { 0x2160, 0x216B }, { 0x2170, 0x2179 },
-    { 0x2190, 0x2199 }, { 0x21B8, 0x21B9 }, { 0x21D2, 0x21D2 },
-    { 0x21D4, 0x21D4 }, { 0x21E7, 0x21E7 }, { 0x2200, 0x2200 },
-    { 0x2202, 0x2203 }, { 0x2207, 0x2208 }, { 0x220B, 0x220B },
-    { 0x220F, 0x220F }, { 0x2211, 0x2211 }, { 0x2215, 0x2215 },
-    { 0x221A, 0x221A }, { 0x221D, 0x2220 }, { 0x2223, 0x2223 },
-    { 0x2225, 0x2225 }, { 0x2227, 0x222C }, { 0x222E, 0x222E },
-    { 0x2234, 0x2237 }, { 0x223C, 0x223D }, { 0x2248, 0x2248 },
-    { 0x224C, 0x224C }, { 0x2252, 0x2252 }, { 0x2260, 0x2261 },
-    { 0x2264, 0x2267 }, { 0x226A, 0x226B }, { 0x226E, 0x226F },
-    { 0x2282, 0x2283 }, { 0x2286, 0x2287 }, { 0x2295, 0x2295 },
-    { 0x2299, 0x2299 }, { 0x22A5, 0x22A5 }, { 0x22BF, 0x22BF },
-    { 0x2312, 0x2312 }, { 0x2460, 0x24E9 }, { 0x24EB, 0x254B },
-    { 0x2550, 0x2573 }, { 0x2580, 0x258F }, { 0x2592, 0x2595 },
-    { 0x25A0, 0x25A1 }, { 0x25A3, 0x25A9 }, { 0x25B2, 0x25B3 },
-    { 0x25B6, 0x25B7 }, { 0x25BC, 0x25BD }, { 0x25C0, 0x25C1 },
-    { 0x25C6, 0x25C8 }, { 0x25CB, 0x25CB }, { 0x25CE, 0x25D1 },
-    { 0x25E2, 0x25E5 }, { 0x25EF, 0x25EF }, { 0x2605, 0x2606 },
-    { 0x2609, 0x2609 }, { 0x260E, 0x260F }, { 0x2614, 0x2615 },
-    { 0x261C, 0x261C }, { 0x261E, 0x261E }, { 0x2640, 0x2640 },
-    { 0x2642, 0x2642 }, { 0x2660, 0x2661 }, { 0x2663, 0x2665 },
-    { 0x2667, 0x266A }, { 0x266C, 0x266D }, { 0x266F, 0x266F },
-    { 0x273D, 0x273D }, { 0x2776, 0x277F }, { 0xE000, 0xF8FF },
-    { 0xFFFD, 0xFFFD }, { 0xF0000, 0xFFFFD }, { 0x100000, 0x10FFFD }
-  };
+  static const struct interval ambiguous[] =
+#include "ambiguous.t"
+
   /* sorted list of non-overlapping intervals of non-spacing characters */
-  /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */
-  static const struct interval combining[] = {
-    { 0x0300, 0x036F }, { 0x0483, 0x0486 }, { 0x0488, 0x0489 },
-    { 0x0591, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
-    { 0x05C4, 0x05C5 }, { 0x05C7, 0x05C7 }, { 0x0600, 0x0603 },
-    { 0x0610, 0x0615 }, { 0x064B, 0x065E }, { 0x0670, 0x0670 },
-    { 0x06D6, 0x06E4 }, { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED },
-    { 0x070F, 0x070F }, { 0x0711, 0x0711 }, { 0x0730, 0x074A },
-    { 0x07A6, 0x07B0 }, { 0x07EB, 0x07F3 }, { 0x0901, 0x0902 },
-    { 0x093C, 0x093C }, { 0x0941, 0x0948 }, { 0x094D, 0x094D },
-    { 0x0951, 0x0954 }, { 0x0962, 0x0963 }, { 0x0981, 0x0981 },
-    { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 }, { 0x09CD, 0x09CD },
-    { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 }, { 0x0A3C, 0x0A3C },
-    { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 }, { 0x0A4B, 0x0A4D },
-    { 0x0A70, 0x0A71 }, { 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC },
-    { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD },
-    { 0x0AE2, 0x0AE3 }, { 0x0B01, 0x0B01 }, { 0x0B3C, 0x0B3C },
-    { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B43 }, { 0x0B4D, 0x0B4D },
-    { 0x0B56, 0x0B56 }, { 0x0B82, 0x0B82 }, { 0x0BC0, 0x0BC0 },
-    { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 }, { 0x0C46, 0x0C48 },
-    { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 }, { 0x0CBC, 0x0CBC },
-    { 0x0CBF, 0x0CBF }, { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD },
-    { 0x0CE2, 0x0CE3 }, { 0x0D41, 0x0D43 }, { 0x0D4D, 0x0D4D },
-    { 0x0DCA, 0x0DCA }, { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 },
-    { 0x0E31, 0x0E31 }, { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E },
-    { 0x0EB1, 0x0EB1 }, { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC },
-    { 0x0EC8, 0x0ECD }, { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 },
-    { 0x0F37, 0x0F37 }, { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E },
-    { 0x0F80, 0x0F84 }, { 0x0F86, 0x0F87 }, { 0x0F90, 0x0F97 },
-    { 0x0F99, 0x0FBC }, { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 },
-    { 0x1032, 0x1032 }, { 0x1036, 0x1037 }, { 0x1039, 0x1039 },
-    { 0x1058, 0x1059 }, { 0x1160, 0x11FF }, { 0x135F, 0x135F },
-    { 0x1712, 0x1714 }, { 0x1732, 0x1734 }, { 0x1752, 0x1753 },
-    { 0x1772, 0x1773 }, { 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD },
-    { 0x17C6, 0x17C6 }, { 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD },
-    { 0x180B, 0x180D }, { 0x18A9, 0x18A9 }, { 0x1920, 0x1922 },
-    { 0x1927, 0x1928 }, { 0x1932, 0x1932 }, { 0x1939, 0x193B },
-    { 0x1A17, 0x1A18 }, { 0x1B00, 0x1B03 }, { 0x1B34, 0x1B34 },
-    { 0x1B36, 0x1B3A }, { 0x1B3C, 0x1B3C }, { 0x1B42, 0x1B42 },
-    { 0x1B6B, 0x1B73 }, { 0x1DC0, 0x1DCA }, { 0x1DFE, 0x1DFF },
-    { 0x200B, 0x200F }, { 0x202A, 0x202E }, { 0x2060, 0x2063 },
-    { 0x206A, 0x206F }, { 0x20D0, 0x20EF }, { 0x302A, 0x302F },
-    { 0x3099, 0x309A }, { 0xA806, 0xA806 }, { 0xA80B, 0xA80B },
-    { 0xA825, 0xA826 }, { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F },
-    { 0xFE20, 0xFE23 }, { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB },
-    { 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }, { 0x10A0C, 0x10A0F },
-    { 0x10A38, 0x10A3A }, { 0x10A3F, 0x10A3F }, { 0x1D167, 0x1D169 },
-    { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD },
-    { 0x1D242, 0x1D244 }, { 0xE0001, 0xE0001 }, { 0xE0020, 0xE007F },
-    { 0xE0100, 0xE01EF }
-  };
+  static const struct interval combining[] =
+#include "combining.t"
+
+  /* sorted list of non-overlapping intervals of wide characters,
+     ranges extended to Blocks where possible
+   */
+  static const struct interval wide[] =
+#include "wide.t"
 
   /* Test for NUL character */
   if (ucs == 0)
@@ -310,20 +216,12 @@ _DEFUN (__wcwidth, (ucs),
 
   /* if we arrive here, ucs is not a combining or C0/C1 control character */
 
-  return 1 + 
-    (ucs >= 0x1100 &&
-     (ucs <= 0x115f ||                    /* Hangul Jamo init. consonants */
-      ucs == 0x2329 || ucs == 0x232a ||
-      (ucs >= 0x2e80 && ucs <= 0xa4cf &&
-       ucs != 0x303f) ||                  /* CJK ... Yi */
-      (ucs >= 0xac00 && ucs <= 0xd7a3) || /* Hangul Syllables */
-      (ucs >= 0xf900 && ucs <= 0xfaff) || /* CJK Compatibility Ideographs */
-      (ucs >= 0xfe10 && ucs <= 0xfe19) || /* Vertical forms */
-      (ucs >= 0xfe30 && ucs <= 0xfe6f) || /* CJK Compatibility Forms */
-      (ucs >= 0xff00 && ucs <= 0xff60) || /* Fullwidth Forms */
-      (ucs >= 0xffe0 && ucs <= 0xffe6) ||
-      (ucs >= 0x20000 && ucs <= 0x2fffd) ||
-      (ucs >= 0x30000 && ucs <= 0x3fffd)));
+  /* binary search in table of wide character codes */
+  if (bisearch(ucs, wide,
+	       sizeof(wide) / sizeof(struct interval) - 1))
+    return 2;
+  else
+    return 1;
 #else /* !_MB_CAPABLE */
   if (iswprint (ucs))
     return 1;
@@ -333,9 +231,9 @@ _DEFUN (__wcwidth, (ucs),
 #endif /* _MB_CAPABLE */
 }
 
-int     
+int
 _DEFUN (wcwidth, (wc),
-	_CONST wchar_t wc)
+	_CONST wint_t wc)
 { 
   wint_t wi = wc;
 
-- 
2.13.2



More information about the Newlib mailing list