Sources Bugzilla – Bug 10580
New file for hr_HR localedata
Last modified: 2013-05-01 17:23:29 UTC
---------------- HISTORY: ---------------- hr_HR locale started out as a copy of sl_SI locale in glibc-2.0 and was maintained by Borka Jerman-Blažič (from Slovenia), shortly afterwards Tomislav Vujec (then at CARNet, now in RedHat) changed it to suite hr specific changes. After around 1998, that locale was only updated by glibc maintainer Ulrich Drapper who added or changed portions of it as mass updates to many locales. I have contacted current maintainer, Tomislav Vujec, last week and he is willing to support changes. Also, since it's been more then a decade since he did changes to this locale, he noted that he'd be willing to pass maintainership to someone else. BTW, he is also maintainer of bs_BA, I hope Bosnian translation team will take over maintenance of that locale... ---------------- RATIONALE: ---------------- The point is: hr_HR locale is now in a state of flux. It kind of works and fails in fairly subtle ways when sorting digraphs. I have made numerous changes which I'll describe below... Croatia doesn't have language law or any real specification of the language rules for writing dates, monetary data and so on. Most of the language decisions in real life are made using common established conventions. I'll rationale my decisions in my change descriptions below, using URLs where needed... I really wanted to make this right, so I've read all of the archive of libc-locale mailing list (2004-now), and also ISO/IEC TW 14652 (albeit 2002 edition which I found for free on the Internet). I've looked at history of changes of hr_HR locale through "git blame". I've also studied at sr_RS locale which is somewhat related to hr_HR since Croatian, Bosnian and Serbian have (or had) lot of common conventions. Initially I only wanted to change LC_COLLATE, but it made sense to update locale as a whole, so it required far more time than I anticipated, but changes made are worth it. I've (heavily) commented the locale, so it should be easy to maintain from now. UTF-8 characters are used only in comments). I've also contacted all hr translators team leaders and pointed them to this bugreport to give their opinion on these changes, since they will be system wide when accepted, and they are, by definition, at the forefront of i18n and l10n efforts. There are some general locale system errors which are not specific to Croatian locale, so if Ulrich Drapper (if he has some time) or someone else versed in in glibc internals can look at change descriptions to LC_COLLATE, LC_ADDRESS and LC_TELEPHONE, and help me a bit with system errors found there while using localedef I'd be really thankful :o) ---------------- CHANGES: ---------------- % <initial comments> Mostly cleaned comments and removed repeatable information into LC_IDENTIFICATION. Added that charset used in croatia should primarily be UTF-8. Previously we used ISO-8859-2 (which should be phased out since it doesn't support digraph characters [dž, lj and nj]) Added my email in the authors list, just so I can be notified in the future when the locale changes. LC_IDENTIFICATION: I bumped revision to 2.0 (from 1.0) since this is a major rewrite of this locale. I have left "CARNet" and their address although I'm not really sure why CARNet (Croatian Academic and Research Network) would have jurisdiction over hr_HR locale. Not even Ministry of Education of Croatia has jurisdiction over it, as they don't supply rules for writing dates, or monetary strings for example. category statements were updated to reflect new changes. Standard requires first parameter to define to which standard this category complies but all other locales just use locale name and a year here, so I did just that too. BTW most locales don't list all categories which are include in their file. For example, they usually don't include LC_MEASUREMENT. I did... LC_CTYPE: Although ISO/IEC TR 14652 has controversial LC_XLITERATE category, glibc uses "translit_start" inside LC_CTYPE. Hence I've added transliteration info (how to transliterate digraphs to ISO-8859-2 and ASCII). I'm not really sure how to test this, I hope I got it right. There is some weird behaviour in included "i18n"... For example it has same character in "upper" AND "lower" class, so both iswupper() and iswlower() give TRUE for <U01C5> {Dž}. I guess this is ok. Another behaviour is that towupper() will make <U01C6> {dž} -> <U01C4> {DŽ}, which can be wrong in some cases where <U01C5> {Dž} is needed. This is not ok, but not fixable in the current implementation anyway, so lets add it to curiosities for now :o) LC_COLLATE: Major revision. I have included "iso14651_t1" like most locales to reap benefits of "iso14651_t1" updates, as well as to significantly reduce hr_HR locale size and increase readability collating-elements are created and linked to the right digraphs [dž, lj and nj] BTW "collating-element" shouldn't be used after "copy", but many locales use it since there is no other way, except putting them in "iso14651_t1" Croatian alphabet considers č, ć, dž, đ, lj, nj, š and ž distinct letters, and that was implemented with reorder-after statements localedef says I have SYNTAX ERROR in LC_COLLATE, probably not liking "<d><z>" digraph literal. Is this really SYNTAX ERROR? It works though... LC_TIME: Names of days and months are now written with right digraphs, and not a combination of ASCII letters ( Digraphs can nowdays be seen in CLI apps as well. For example `cal 2009`. d_t_fmt was changed to format like: "Ponedjeljak, 31. Kolovoz 2009. 16:35:05 CEST" (The best we can in current implementation. Croatia uses declension in month names like most Slavic languages) [ This format can be seen on Croatian government pages http://vlada.hr/ ] date_fmt was changed to format like: "Pon, 31.08.2009. 16:49:36 CEST" [ Croatia in general doesn't use short versions of month or day names. For month we usually use number as seen on pages of Croatian president [ http://www.predsjednik.hr/ ] d_fmt is changed to format like: "01.09.2009." for reasons same as in date_fmt change explanation. Croatians read and write dd.mm.yyyy format for decades. If someone objects that it confuses people who use mm.dd.yyyy (us) format, I agree, but this is hr_HR locale and this form is widely used in Croatia. System software should use YYYY-MM-DD format anyway regardles of locale. t_fmt is changed to format like: "HH:MM:SS" I've added week, first_weekday, first_workday. first_weekday and first_workday are set to Monday LC_NUMERIC I've set thousands_sep to '.' So formating of numbers is "12.345.678,90" or "-12.345.678,90" LC_MONETARY I've lowecased currency_symbol to "kn" since that form is what majority of citizens/shops nowdays use. See online shops: http://www.links.hr/ , http://www.profil.hr/ , and many others. You can see there is no rule for this at wikipedia: http://hr.wikipedia.org/wiki/Hrvatska_kuna , where they note that Symbol is "Kn" but use "kn" a lot on the same page I've added thousands_sep to '.' as in LC_NUMERIC I've changed monetary string format to: "14.986,42 kn", "-14.986,42 kn" and for international to "HRK 14.986,42" and "-HRK 14.986,42" as was agreed upon in 2003 by Tomislav Vujec on libc-alpha [ http://sourceware.org/ml/libc-alpha/2003-04/msg00254.html ]. I'm not really sure that in international version HRK should be before the value (as said at the top, there is no law on how to write monetary values in Croatia just conventions). I'd leave them the same as local versions, and just use HRK instead of kn but I've complied with libc-aplha agreement of 2003 for now. LC_MESSAGES: I've removed trailing .* in yesexpr and noexpr as it was discussed in libc-locales mailing list [ http://sources.redhat.com/bugzilla/show_bug.cgi?id=71 ] that it's not really necessary. I didn't include 1 in yesexpr and 0 in noexpr although this was discussed in libc-locales mailing list too. But not many locales use it, so I've skipped it for now I've added yesstr, and nostr LC_NAME: Changes name_fmt to "salutation name other_name surnames" I've added name_mr, name_mrs, and name_miss. Croatia doesn't have gender neutral salutation, nor neutral female (name_ms) version of salutation LC_ADDRESS: postal_fmt is changed, so that address now looks like: Company name Department name Person's name C/O Person or Organization Street name and house number ZIP Code and City name Country localedef complains that postal_fmt have invalid escape sequence, I don't know why!?! I've added definitions for many missing attributes: country_post, country_car, country_isbn, lang_name, lang_ab, lang_term and lang_lib LC_TELEPHONE I've changed tel_int_fmt to look like: "+<country code> <area code without leading 0> <local number> < possible ext>" I've changed tel_dom_fmt to look like: "<possible area code with leading 0> <local number> <possible ext>" localedef complains that tel_int_fmt and tel_dom_fmt have invalid escape sequence, I don't know why!?! LC_PAPER A4 is used in Croatia LC_MEASUREMENT Croatia uses metric measurements ---------------- TESTING: ---------------- To see the file without Uxxx literals, I made this ugly oneliner which make HTML version of it. Just change file name at the start and you can use it with other locale files as well. ( FILE=hr_HR; sed -e 's/<U\([0-9A-F][0-9A-F][0-9A-F][0-9A-F]\)>/\<\&#x\1;\>/g' < $FILE > $FILE.tmp; sed -e 's/</\</g' < $FILE.tmp > $FILE.html; sed -e 's/>/\>/g' < $FILE.html > $FILE.tmp; echo "<pre>" > $FILE.html; cat $FILE.tmp >> $FILE.html; rm $FILE.tmp ) Also to test collating in hr_HR locale I made small dictionary which has Croatian digraphs in all forms, as well as letters which are considered distinct. To test collation with it I do the following: randomize it with `sort -R`, and resort it. The end file should have same MD5 as the starting one... Testing of other locale categories is a bit harder, but small C programs work well, and most code templates you have in glibc source / localedata anyway.
Created attachment 4158 [details] new hr_HR locale file This is not a patch file since it's completely new file and patch would be huge without valid reason for it. This file is just 1/7th of the file it replaces!
Created attachment 4159 [details] Small Croatian dictionary for testing Small dictionary of already sorted Croatian words which have digraphs, it's variations and letters which are considered distinct in an Croatian alphabet which affects sorting. Use `sort -R` to randomize it, and `sort` to check you get the same version back
Just a note that I fully support the changes. Further more, since I moved out of Croatia 10 years ago, I am unable to stay in sync with language and locale relevant policies and rules. Therefore, I would like to ask that a new maintainer is selected. I don't know if there is an official process for this now days, but since Dragan did all this work, I would like to support him if he wants to take over that role.
Thank you Tomislav for your support, KDE l10n team leader contacted me, I'm still waiting for GNOME l10n team to have their say. As for new maintainership, I'm willing to accept it for hr_HR locale
GNOME translation team still hasn't responded to my query. Instead I have contacted Croatian Ubuntu team which also does translation work, as well as Croatian Linux news group. So lets wait for few more days...
Created attachment 4175 [details] A new version of a hr_HR locale (with lowecased day and month names)
Apart for lowercasing day and month names, nobody had any objections to this new version of locale. I think this can be commited to libc-locales Thank you all for your time, N::
The locale doesn't compile correctly: /home/drepper/gnu/libc/localedata/locales/hr_HR:143: LC_COLLATE: syntax error LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt' LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt' LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt' no output file produced because warnings were issued
(In reply to comment #8) > The locale doesn't compile correctly: > > /home/drepper/gnu/libc/localedata/locales/hr_HR:143: LC_COLLATE: syntax error > LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt' > LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt' > LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt' > no output file produced because warnings were issued Hi Ulrich, thanks for your time and reply... I'm aware of this errors, but they are more system wide errors than hr_HR ones. I wrote about them in the long explanation... I'm quite sure you didn't have time to read it all, but let me repeat the last paragraph of RATIONALE which is of importance here: ------------------------------------------------------------------------------ There are some general locale system errors which are not specific to Croatian locale, so if Ulrich Drapper (if he has some time) or someone else versed in in glibc internals can look at change descriptions to LC_COLLATE, LC_ADDRESS and LC_TELEPHONE, and help me a bit with system errors found there while using localedef I'd be really thankful :o) ------------------------------------------------------------------------------ Allow me to elaborate just a bit to make it easier for you: "hr_HR:143: LC_COLLATE: syntax error" :: I've used quotes to mark digraph <d><z>. I used that designation since the same designation is used in "iso14651_t1_common" file... Look with: `grep '<d><z>' iso14651_t1_common` "LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt'" :: %n is a valid escape sequence per "ISO/IEC TW 14652". It states: "%n -- Person's name, possibly constructed with the LC_NAME "name_fmt" keyword" LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt' and LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt' :: Again, as per ISO/IEC TW 14652, it contains no invalid escape sequence... %c %a %A %l %e and %t are mentioned in standard. Thank you once more... N::
Dragan, thank you for your work. It is true that the locales in glibc are not fully ISO/IEC 14652 compliant, in particular some fields that should be used in fact are not. I'm not personally sure why this is the case, probably it's for purely historical reasons. However, I believe the greatest value lies in consistency, and if no current locales use %n in postal_fmt and %e and %t in tel_*_fmt, neither should hr_HR as the programs using these locales probably do not expect to find these field descriptors there. So let's not conflate the issue of unsupported field descriptors with the new hr_HR locale; could you please submit an hr_HR locale version that does not use these field descriptors? Since you got a buy-in from other Croatians active in this area, I think we can commit the new locale speedily afterwards. (Regarding the issue of unsupported field descriptors, if you are interested in pursuing that further. A simple technical fix is to simply patch locale/programs/ld-{telephone,address}.c to allow these. However, we should do this with consideration to locale consistency and current usage of these categories in programs. This needs to be researched and I think the next reasonable step is to document the currently supported field descriptors in "glibc style locales". We can then think of how to proceed further while our users will already have a valuable reference. This process can be done gradually, category by category. Does that sound sensible?)
On Sat, Feb 16, 2013 at 12:39:45AM +0000, pasky at ucw dot cz wrote: > (Regarding the issue of unsupported field descriptors, if you are interested in > pursuing that further. A simple technical fix is to simply patch > locale/programs/ld-{telephone,address}.c to allow these. However, we should do > this with consideration to locale consistency and current usage of these > categories in programs. This needs to be researched and I think the next > reasonable step is to document the currently supported field descriptors in > "glibc style locales". We can then think of how to proceed further while our > users will already have a valuable reference. This process can be done > gradually, category by category. Does that sound sensible?) I would rather take another approach, and that would be to further implement ISO TR 14652 or the new version thereof, ISO TR 30112. ISO TR 30112 is closer to glibc, as some things that glibc implements is now specified in 30112, including LC_PAPER. Best regards Keld
Thank you for your comments! So if I understand correctly, I just need to trim LC_ADDRESS and LC_TELEPHONE to comply with current support in glibc, and you'll accept the whole patch? That would be great, since it resolves a lot of issues, shortens the file, makes it more manageable for future changes, and so on... Back then I've read whole ISO/IEC TW 14652, tried to mimic other locale format as much as possible and I think I made good patch. In the end I thought I'd need to learn flex & bison to improve glibc parsing of those data, but that was beyond me.
Keld, of course using the newer standard makes sense; however, I'm not sure what do you mean by "further implement" and how that differs from what I wrote. If you are interested in discussing this further, I propose we move the discussion to the mailing list where more people could follow it. (Note that I myself don't have the time to pursue the issue itself, so it makes sense to talk more about it only if someone intends to do anything about it.) Dragan, I'm sorry, I missed the LC_COLLATE syntax error. Any reason why we cannot use the unicode entity there instead? Also, I'm wondering, how was testing of this locale done if it doesn't even compile with glibc's localedef now? And which of the people that provided support for the new locale actually tested it rather than just embraced the idea?
On Sun, Feb 17, 2013 at 12:13:28AM +0000, pasky at ucw dot cz wrote: > http://sourceware.org/bugzilla/show_bug.cgi?id=10580 > > --- Comment #13 from Petr Baudis <pasky at ucw dot cz> 2013-02-17 00:13:28 UTC --- > Keld, of course using the newer standard makes sense; however, I'm not sure > what do you mean by "further implement" and how that differs from what I wrote. > If you are interested in discussing this further, I propose we move the > discussion to the mailing list where more people could follow it. (Note that I > myself don't have the time to pursue the issue itself, so it makes sense to > talk more about it only if someone intends to do anything about it.) So where should we do the discussion? I did think that this list was relevant. Anyway, the differences are not big,.. It is mostly to align with current glibc implementation, and then introduce 2 novelties. Best regards Keld
Hi, let me be frank. This was made in 2009. I've spend at least a week reading ISO documents, comparing to other locales similar to hr_HR, contacting Croatian Linux User Group and writing tests. Every question considering compile errors was answered in the huge description of the patch, and repeated on comment #9, since obviously Drepper didn't read it in the first place when he dissed the patch. If you don't want LC_ADDRESS or LC_TELEPHONE, copy them from C locale. If you don't want to implement "<d><z>", comment it out... Also there is no standard test suite for this locale categories. I find it hard to believe that I (or any locale writer) have to write custom test suites from scratch again, nor do I have the time. I repeat, this patch was a big improvement in 2009. I don't have time to again write test suites from scratch. Let alone to reread ISO documents, and patch libc itself. It's your choice will you ever apply this patch.
I have read everything you have written in this bugreport; I might have missed something, but I asked my questions because I believe they weren't answered in the previous comments. My question was not geared at test suites, though I appreciate your effort to test the collation rules. I was just wondering whether and how this locale (considering that it cannot be compiled by localedef as it is now) was tried out with actual commonly used software, and whether that was done just by you or by the other people supporting it too. If you could adjust the locale into a compilable form, we can easily ask other to test it so that we can incorporate any bugfixes before the next release; this (besides few simple sanity checks I'll do) does not need to block committing the new locale.
Heyyah Petr, thanks for reading and a reply. Give me a few days, and I'll try to test and fix this patch to compile using 2.17. I cannot vouch for testing of others who saw and gave approval of this patch. I did it myself as I was displeased of the state of hr_HR locale back then. I was mainly interested in collation, but did a lot more research then intended, and in turn patched all categories of locale. During that, I've cleaned, commented and trimmed the locale file considerably. bye for now, N::
Hi! Yes, I fully appreciate your efforts - I just want to confirm the status of the new locale regarding how it has been tested. Glad you decided to update your version of the locale, we will be looking forward to the new version. I can't think of specific updates that would be required for 2.17 (there were no changes in stock hr_HR since 2009), so mainly making it compile would be great.
Created attachment 6876 [details] An updated version of hr_HR which solves problems with LC_COLLATE, LC_ADDRESS and LC_TELEPHONE sections This is promised update to the hr_HR locale. Changes are: - bumped revision to 2.1 and a date to current date - removed duplicate character transliterations from LC_CTYPE which are found in i18n - changed LC_COLLATE error, and tested with the small Croatian dictionary provided in 2009 using "sort -R dict_file > scrambled_file; sort scrambled_file > sorted_file" md5 sums of original <dict_file> and <sorted_file> are the same - updated some comments, and some spacings - changed thousands_sep and mon_thousands_sep to " " instead of "." char to comply with the suggestions in language books published since 2009. - updated LC_ADDRESS to remove %n (persons name) field since it's not yet available in the code. Other locales fall back to %a (care of person or organization) and that's ok for now. - cleaned LC_TELEPHONE by removing %t (space or null string) and %e (extension) fields which are currently unsupported in the code. Falled back to "+%c %a %l" and "%A %l" as seen in other locales. Locale now compiles cleanly using localedef...
Created attachment 6877 [details] An updated version of hr_HR which solves problems with LC_COLLATE, LC_ADDRESS and LC_TELEPHONE sections Fixed small typos in comments... Reset the bug status to "NEW", to signify it's ready for review by mainteiners of the library... Thanks for your time, N::
Will you accept this patch? It also fixes #15264
Created attachment 7010 [details] Updated version of hr_HR Removed CARNet as source of the locale, and their address since I don't have any official relation to them, and the locale is completely changed. Small fixes in comments of the locale Bumped version to 2.2 and date to 2013-05-01