24314 – charmaps: Some of UTF-8 characters have invalid width

Bug 24314 - charmaps: Some of UTF-8 characters have invalid width

Summary: charmaps: Some of UTF-8 characters have invalid width

Status:	RESOLVED INVALID

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-03-08 11:24 UTC by Łukasz Stelmach
Modified:	2019-03-08 20:03 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Łukasz Stelmach 2019-03-08 11:24:40 UTC

Some characters are assigned invalid width in localedata/charmaps/UTF-8 file (lines below 47072). For example \u2693 (ANCHOR) is described as double-width character.

There is a procedure for deriving the data from the standard files, which says that double width characters come from EastAsianWidth.txt file.

  grep '^[^;]*;[WF]' EastAsianWidth.txt | grep 2693

returns no results which means the line 47261 (as of commit c5f65462a2) which says

  <U2693> 2

is wrong. At least 28 other characters seem to be improperly classified as double-width too. Use the following command to find them.

  perl -ne 'next if (1..47080 or /\.\.\./);  print if (/2$/);' localedata/charmaps/UTF-8

None of these characters can be found in the output of 

  grep '^[^;]*;[WF]' EastAsianWidth.txt

Apparently the localedata/unicode-gen/utf8_gen.py script has failed to filter out these characters.

Comment 1 Egmont Koblinger 2019-03-08 13:56:15 UTC

(In reply to Łukasz Stelmach from comment #0)

>   grep '^[^;]*;[WF]' EastAsianWidth.txt | grep 2693
> 
> returns no results which means the line 47261 (as of commit c5f65462a2)

This command _does_ print "2693;W" for me, as of the aforementioned commit, assuming the input file is glibc's localedata/unicode-gen/EastAsianWidth.txt (line 1210).

Note that the width of many codepoints, including this one, changed from narrow to wide with Unicode 9.0. Compare these two files:

ftp://ftp.unicode.org/Public/8.0.0/ucd/EastAsianWidth.txt ("2670..269D;N")
ftp://ftp.unicode.org/Public/9.0.0/ucd/EastAsianWidth.txt ("2693;W")

Any chance you worked from a Unicode 8 (or older) EastAsianWidth.txt, rather than the one in glibc's source?

(Also note that your grep command can easily miss matches, since the file defines ranges. It's not the case with U+2693 though.)

Comment 2 Łukasz Stelmach 2019-03-08 19:03:49 UTC

TL;DR Indeed, I was working with an old data file.

As an excuse I can only say, that several fonts provide this character as normal rather than wide, which matched ma observation of the outdated data file. 

I guess, this bug can be closed then. Thank you.

Comment 3 Florian Weimer 2019-03-08 20:03:48 UTC

Thanks, closing as requested.