This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/16061] Review / update transliteration data


https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #5 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Marko Myllynen from comment #4)
> (In reply to Mike FABIAN from comment #2)
> > (In reply to Marko Myllynen from comment #0)
> > 
> > C-translit.h.in seems to be manually edited and not generated from
> > Unicode data.
> 
> Based on earlier changelog comments it seems that C-translit.h.in was
> updated manually for Unicode 3.2.0, should it now be updated for Unicode
> 7.0.0 by some means?

Probably, but how?
> > is apparently manually edited and not generated.
> > 
> >     locales/translit_cjk_variants
> > 
> > is not generated from Unicode data either but from a UniVariants.Z
> > file which can still be found here:
> > 
> > http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z
> > 
> > It is from 2002-08-15 and I have no idea how it has been created.
> > So I did not touch /translit_cjk_variants.
> 
> Perhaps we could add a note about its origins to the file.

There is already a note in the comment section of that file.

> Also, shouldn't à and à be handled in the same way?

What do you mean by âhandled in the same wayâ? 

> Looking at translit_neutral in more detail, I think it's actually wrong
> place for letters, it should contain non-letters only and if specific rules
> are needed for letters like à or Ã, those should be added directly in locale
> files (so the patch discussed in bug 15593 should have not been applied to
> translit_neutral after all). This would also mean that the special rules in
> the generator for cases like EM DASH and EN DASH should probably end up to
> translit_neutral not translit_combining.

My guess is that the purpose of translit_neutral is to contain
transliterations which are locale âneutralâ, i.e. are the same for
all locales. So I see no reason not to include letters.

> > > but some characters (like U+00D6, Ã) have decomposition defined in
> > > Unicode but not in glibc.
> > 
> > glibc had this already in translit_combining:
> > 
> > (was already there, not added by my patch, it is generated from
> > UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
> > combining character U+0308).
> 
> Yes, I think what I meant to say was that the decomposition to U+004F U+0308
> was missing but as you point out it is defined in some locales where it
> would be needed. Btw, I wonder should U+00D6 actually decompose to U+004F
> U+00A8 after U+004F U+0308 in those locales?

à -> OÂ

Why? Is that a reasonable transliteration? It throws away less
information but I think it is common practice to transliterate Ã
just as O in English for example.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]