Bug 14010 - Serious omissions in alphabetic character class
Summary: Serious omissions in alphabetic character class
Status: RESOLVED DUPLICATE of bug 14094
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-23 01:37 UTC by Rich Felker
Modified: 2014-12-04 10:32 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Rich Felker 2012-04-23 01:37:32 UTC
The localedata generation code defines is_alpha based on Unicode categories L*, plus Nl, Nd, and a moderate number of special cases mostly to fix Thai language support (to fix is_alpha returning false for letters in category Mn). However Thai is not the only language affected; any language that uses non-spacing letters is broken by glibc's deficient is_alpha definition. As a particular example, all of the Tibetan subjoined letters are considered non-alphabetic (and thus punctuation) by glibc.

Unicode addresses this issue by defining the Other_Alphabetic property in PropList.txt and the Alphabetic derived property in DerivedCoreProperties.txt, the latter of which consists of Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic. This subsumes all special-case hacks for Thai in glibc's gen-unicode-ctype.c and fixes the issue (at least approximately) for all other languages/scripts at the same time.

glibc's localedata should adopt the definition of Alphabetic from Unicode's 
DerivedCoreProperties.txt (and still add Nd and the special cases from So).
Comment 1 Rich Felker 2012-09-21 23:07:28 UTC
Ping. Has anybody looked at this?
Comment 2 joseph@codesourcery.com 2012-09-23 19:34:48 UTC
We know that there are over 500 open bugs and bugs are filed faster than 
they are fixed.  Constructive responses on libc-alpha to 
<http://sourceware.org/ml/libc-alpha/2012-08/msg00611.html> regarding how 
to get more people actively fixing more bugs would be more useful, towards 
the goal of getting down to maybe 100 bugs that are genuinely hard, than 
pinging individual bugs (unless the ping is for something like reminding 
someone to submit a patch or test whether a commit has fixed the bug for 
them - where there is clear in-progress work that may have been forgotten 
about).

There's plenty of room for an interested person to become glibc's 
character set expert and address this bug, bug 14094 and bug 14095 (only 
14095 is particularly likely to be hard) and probably other bugs as well.
Comment 3 Rich Felker 2013-10-25 12:49:05 UTC
Joseph, thanks for acknowledging this bug. Issue 14094 looks related (as in, both could be resolved at the same time, if desired), but 14095 is a completely separate matter and I don't think it's helpful to tie them together.
Comment 4 joseph@codesourcery.com 2013-10-25 15:19:28 UTC
On Fri, 25 Oct 2013, bugdal at aerifal dot cx wrote:

> Joseph, thanks for acknowledging this bug. Issue 14094 looks related (as in,
> both could be resolved at the same time, if desired), but 14095 is a completely
> separate matter and I don't think it's helpful to tie them together.

The connection is that they all (and bug 16061) need someone to act as 
glibc's character set / Unicode expert and do a proper analysis of the 
issues involved and the current state of this data in glibc.
Comment 5 Rich Felker 2013-10-25 15:37:25 UTC
On Fri, Oct 25, 2013 at 03:19:28PM +0000, joseph at codesourcery dot com wrote:
> The connection is that they all (and bug 16061) need someone to act as 
> glibc's character set / Unicode expert and do a proper analysis of the 
> issues involved and the current state of this data in glibc.

My view is that I don't think it requires a collation expert to handle
the fixing of the alphabetic class and/or updating the character class
data to latest Unicode. Collation is a much more specialized expertise
requirement.
Comment 6 Mike FABIAN 2014-12-04 10:32:59 UTC
(In reply to Rich Felker from comment #5)
> On Fri, Oct 25, 2013 at 03:19:28PM +0000, joseph at codesourcery dot com
> wrote:
> > The connection is that they all (and bug 16061) need someone to act as 
> > glibc's character set / Unicode expert and do a proper analysis of the 
> > issues involved and the current state of this data in glibc.
> 
> My view is that I don't think it requires a collation expert to handle
> the fixing of the alphabetic class and/or updating the character class
> data to latest Unicode. Collation is a much more specialized expertise
> requirement.

https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c33 amd following
comments address the problem with the alphabetic class and  
updating the character classes to the latest Unicode.

So I think we can mark this bug here as a duplicate of bug#14094.

*** This bug has been marked as a duplicate of bug 14094 ***