locale encodings

Keld Simonsen keld@keldix.com
Thu Nov 14 11:33:00 GMT 2013


I am aware of the problem, and will look into it.
It may take some time, tho.

Best regards
keld

On Thu, Nov 14, 2013 at 09:50:05AM +0200, Troy Korjuslommi wrote:
> By the way, I ran some tests on the fi_FI locale for glibc-2.18 and it
> seems to contain out of date information in regards to collation. The
> correct collation order/data are specified in Finnish standard SFS-EN
> 13710 published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC
> 14651) and CLDR, and implemented in ICU. Quick look at the fi_FI file
> tells me that at least the dates are off, which would imply the data
> being off. The collation errors seem to be diacritic related, so I would
> have to go through the actual data to determine whether the error is in
> strcoll's dealing with UTF-8 or the collation data. The collation data
> seems to be the most likely suspect. Keld, your name is listed as the
> contact, so maybe best that you check this out. In case only the
> comments are off. Also, the charset is wrong. It is listed as iso-8859-1
> for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for
> Finnish is UTF-8. Only UTF-8 includes all the characters included in the
> current standards.
> 
> Since EN 13710 specifies a European collation order, it should also be
> used in other Europan locales as the default sorting order.
> 
> I've tried to push for more cooperation with CLDR in the past too, and
> here is a good case in point why it would actually be a good idea to
> keep an eye on CLDR. There is no need to automate the process
> (difficulty of which seems to be the reason for resisting CLDR), just
> get the relevant data. Running comparison tests between cldr and libc
> would also be a good idea. ICU is pretty up-to-date in terms of CLDR and
> other Unicode.org data, so that would be an easy way to implement the
> tests.
> 
> Troy
> 
> 
> On Tue, 2013-11-12 at 10:37 -0500, Steven Abner wrote:
> > On 12 Nov 2013, at 9:34 AM, Steven Abner wrote:
> > 
> > > all data that is important, save one, is in POSIX's 7-bit ASCII
> > 
> >  I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my
> > head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8.
> >  As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer
> > who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up
> > JIS0201 or even their own proprietary 128 or less byte display code, and internal communications.
> > That same designer could use UTF8, and default character information from glibc locales to
> > create an embedded version of a code set for microwaves in China.
> >   Not saying this is standard, but my point was, I guess, is default character set for the locale could
> > or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted
> > strings either in default character set or UTF8.
> >   I am no expert, just food for thought.
> > Steve
> 



More information about the Libc-locales mailing list