[PATCH] Use Unicode code points for country_isbn

Wed Jul 22 20:04:00 GMT 2015

On Wed, 22 Jul 2015, Keld Simonsen wrote:

> > On the build system on which glibc is built, we can always assume that the 
> > glibc sources are the exact sequences of octets provided by the glibc 
> > project, not converted into another character set and without any 
> > conversions of line endings.  Furthermore, on any system using glibc and 
> > executing tools such as localedef with the installed locale source files, 
> > it can be assumed that those source files are the files shipped with 
> > glibc, not those files after conversion into another character set.  Use 
> > of glibc source files after conversion into another character set is 
> > outside the scope of the glibc project - glibc is not expected to build 
> > with such converted source files.
> 
> Sounds strange. glibc is the library for the GNU C language. Standard 

No it's not.  It's the C library for the GNU system.  glibc has a range of 
requirements, including ELF, TLS, an MMU, two's complement integers, 
32-bit int, 32-bit or 64-bit long, 32-bit UTF-32 wchar_t, IEEE binary32 
float, IEEE binary64 double, various GNU tools present on the build system 
as documented in install.texi, ....

> ISO C is coded character set independent, as is also POSIX. Why would 
> the glibc project not follow ISO C and POSIX design goals? Why would 

Because glibc makes particular implementation choices in areas that are 
implementation-defined.  It's an implementation, not a meta-implementation 
that tries to cover the range of permitted implementation choices.  
Meta-implementations (at least of the language part of ISO C) exist, but 
they exist in the field of formal systems used to reason about C programs.

> glibc exclude itself from Apple and Microsoft (utf16) and non-utf8 Linux 
> and UNIX systems?

It's about 15-20 years since glibc was usable as a replacement C library 
for systems with an existing native non-free C library.  Those systems are 
not relevant to glibc nowadays (Apple and Microsoft systems fail the basic 
requirement of using ELF, which is assumed all over glibc).  UTF-16 is 
supported in iconv (only), just like EBCDIC.  Non-UTF-8 locales are 
supported, but deprecated (new non-UTF-8 locales should not be added, and 
any existing non-UTF-8 locales should have a UTF-8 counterpart), and to be 
usable in a POSIX-compliant way must have a character set that includes 
ASCII.

Given sufficiently many GNU tools built on a non-GNU build system, it 
should be possible to cross-compile glibc there - but localedef itself is 
only ever linked against glibc and run on a system using glibc (the 
cross-localedef functionality checked in to glibc is limited to allowing 
one glibc system to generate locales for another system with the same 
glibc version but a different endianness).

> > Now, it's true that the installed localedef utility should be usable in 
> > locale A to generate locale B, for any pair (A, B) of installed locales - 
> > rather than only being able to generate locales as part of the glibc build 
> > / install process.  If localedef interprets locale sources in the 
> > character set of the locale in which it runs, that may mean the installed 
> > locale sources do need to be in ASCII.  How does localedef determine the 
> > character set in which to interpret the textual locale source files?
> 
> Yes, that is why we use UCS symbolic code points. I would then rather to be

"Yes" does not answer my question about how localedef determines the 
character set of its input.

> fully consistent use UCS symbolic code points all the way thru a locale 
> source, it is a bit more cumbersome, but I would rather be consistent. 

I'd rather have some extension to allow a locale source file to declare 
that it is in UTF-8, and then use UTF-8 throughout except for control 
characters or combining characters used in isolation.

-- 
Joseph S. Myers
joseph@codesourcery.com