Description
Dragan Stanojevic - Nevidljivi
2009-08-31 22:54:19 UTC
Created attachment 4158 [details]
new hr_HR locale file
This is not a patch file since it's completely new file and patch would be huge
without valid reason for it. This file is just 1/7th of the file it replaces!
Created attachment 4159 [details]
Small Croatian dictionary for testing
Small dictionary of already sorted Croatian words which have digraphs, it's
variations and letters which are considered distinct in an Croatian alphabet
which affects sorting.
Use `sort -R` to randomize it, and `sort` to check you get the same version
back
Just a note that I fully support the changes. Further more, since I moved out of Croatia 10 years ago, I am unable to stay in sync with language and locale relevant policies and rules. Therefore, I would like to ask that a new maintainer is selected. I don't know if there is an official process for this now days, but since Dragan did all this work, I would like to support him if he wants to take over that role. Thank you Tomislav for your support, KDE l10n team leader contacted me, I'm still waiting for GNOME l10n team to have their say. As for new maintainership, I'm willing to accept it for hr_HR locale GNOME translation team still hasn't responded to my query. Instead I have contacted Croatian Ubuntu team which also does translation work, as well as Croatian Linux news group. So lets wait for few more days... Created attachment 4175 [details]
A new version of a hr_HR locale (with lowecased day and month names)
Apart for lowercasing day and month names, nobody had any objections to this new version of locale. I think this can be commited to libc-locales Thank you all for your time, N:: The locale doesn't compile correctly: /home/drepper/gnu/libc/localedata/locales/hr_HR:143: LC_COLLATE: syntax error LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt' LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt' LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt' no output file produced because warnings were issued (In reply to comment #8) > The locale doesn't compile correctly: > > /home/drepper/gnu/libc/localedata/locales/hr_HR:143: LC_COLLATE: syntax error > LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt' > LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt' > LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt' > no output file produced because warnings were issued Hi Ulrich, thanks for your time and reply... I'm aware of this errors, but they are more system wide errors than hr_HR ones. I wrote about them in the long explanation... I'm quite sure you didn't have time to read it all, but let me repeat the last paragraph of RATIONALE which is of importance here: ------------------------------------------------------------------------------ There are some general locale system errors which are not specific to Croatian locale, so if Ulrich Drapper (if he has some time) or someone else versed in in glibc internals can look at change descriptions to LC_COLLATE, LC_ADDRESS and LC_TELEPHONE, and help me a bit with system errors found there while using localedef I'd be really thankful :o) ------------------------------------------------------------------------------ Allow me to elaborate just a bit to make it easier for you: "hr_HR:143: LC_COLLATE: syntax error" :: I've used quotes to mark digraph <d><z>. I used that designation since the same designation is used in "iso14651_t1_common" file... Look with: `grep '<d><z>' iso14651_t1_common` "LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt'" :: %n is a valid escape sequence per "ISO/IEC TW 14652". It states: "%n -- Person's name, possibly constructed with the LC_NAME "name_fmt" keyword" LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt' and LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt' :: Again, as per ISO/IEC TW 14652, it contains no invalid escape sequence... %c %a %A %l %e and %t are mentioned in standard. Thank you once more... N:: Dragan, thank you for your work. It is true that the locales in glibc are not fully ISO/IEC 14652 compliant, in particular some fields that should be used in fact are not. I'm not personally sure why this is the case, probably it's for purely historical reasons. However, I believe the greatest value lies in consistency, and if no current locales use %n in postal_fmt and %e and %t in tel_*_fmt, neither should hr_HR as the programs using these locales probably do not expect to find these field descriptors there. So let's not conflate the issue of unsupported field descriptors with the new hr_HR locale; could you please submit an hr_HR locale version that does not use these field descriptors? Since you got a buy-in from other Croatians active in this area, I think we can commit the new locale speedily afterwards. (Regarding the issue of unsupported field descriptors, if you are interested in pursuing that further. A simple technical fix is to simply patch locale/programs/ld-{telephone,address}.c to allow these. However, we should do this with consideration to locale consistency and current usage of these categories in programs. This needs to be researched and I think the next reasonable step is to document the currently supported field descriptors in "glibc style locales". We can then think of how to proceed further while our users will already have a valuable reference. This process can be done gradually, category by category. Does that sound sensible?) On Sat, Feb 16, 2013 at 12:39:45AM +0000, pasky at ucw dot cz wrote:
> (Regarding the issue of unsupported field descriptors, if you are interested in
> pursuing that further. A simple technical fix is to simply patch
> locale/programs/ld-{telephone,address}.c to allow these. However, we should do
> this with consideration to locale consistency and current usage of these
> categories in programs. This needs to be researched and I think the next
> reasonable step is to document the currently supported field descriptors in
> "glibc style locales". We can then think of how to proceed further while our
> users will already have a valuable reference. This process can be done
> gradually, category by category. Does that sound sensible?)
I would rather take another approach, and that would be to further implement
ISO TR 14652 or the new version thereof, ISO TR 30112. ISO TR 30112 is closer
to glibc, as some things that glibc implements is now specified in
30112, including LC_PAPER.
Best regards
Keld
Thank you for your comments! So if I understand correctly, I just need to trim LC_ADDRESS and LC_TELEPHONE to comply with current support in glibc, and you'll accept the whole patch? That would be great, since it resolves a lot of issues, shortens the file, makes it more manageable for future changes, and so on... Back then I've read whole ISO/IEC TW 14652, tried to mimic other locale format as much as possible and I think I made good patch. In the end I thought I'd need to learn flex & bison to improve glibc parsing of those data, but that was beyond me. Keld, of course using the newer standard makes sense; however, I'm not sure what do you mean by "further implement" and how that differs from what I wrote. If you are interested in discussing this further, I propose we move the discussion to the mailing list where more people could follow it. (Note that I myself don't have the time to pursue the issue itself, so it makes sense to talk more about it only if someone intends to do anything about it.) Dragan, I'm sorry, I missed the LC_COLLATE syntax error. Any reason why we cannot use the unicode entity there instead? Also, I'm wondering, how was testing of this locale done if it doesn't even compile with glibc's localedef now? And which of the people that provided support for the new locale actually tested it rather than just embraced the idea? On Sun, Feb 17, 2013 at 12:13:28AM +0000, pasky at ucw dot cz wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=10580
>
> --- Comment #13 from Petr Baudis <pasky at ucw dot cz> 2013-02-17 00:13:28 UTC ---
> Keld, of course using the newer standard makes sense; however, I'm not sure
> what do you mean by "further implement" and how that differs from what I wrote.
> If you are interested in discussing this further, I propose we move the
> discussion to the mailing list where more people could follow it. (Note that I
> myself don't have the time to pursue the issue itself, so it makes sense to
> talk more about it only if someone intends to do anything about it.)
So where should we do the discussion? I did think that
this list was relevant. Anyway, the differences are not big,..
It is mostly to align with current glibc implementation, and then
introduce 2 novelties.
Best regards
Keld
Hi, let me be frank. This was made in 2009. I've spend at least a week reading ISO documents, comparing to other locales similar to hr_HR, contacting Croatian Linux User Group and writing tests. Every question considering compile errors was answered in the huge description of the patch, and repeated on comment #9, since obviously Drepper didn't read it in the first place when he dissed the patch. If you don't want LC_ADDRESS or LC_TELEPHONE, copy them from C locale. If you don't want to implement "<d><z>", comment it out... Also there is no standard test suite for this locale categories. I find it hard to believe that I (or any locale writer) have to write custom test suites from scratch again, nor do I have the time. I repeat, this patch was a big improvement in 2009. I don't have time to again write test suites from scratch. Let alone to reread ISO documents, and patch libc itself. It's your choice will you ever apply this patch. I have read everything you have written in this bugreport; I might have missed something, but I asked my questions because I believe they weren't answered in the previous comments. My question was not geared at test suites, though I appreciate your effort to test the collation rules. I was just wondering whether and how this locale (considering that it cannot be compiled by localedef as it is now) was tried out with actual commonly used software, and whether that was done just by you or by the other people supporting it too. If you could adjust the locale into a compilable form, we can easily ask other to test it so that we can incorporate any bugfixes before the next release; this (besides few simple sanity checks I'll do) does not need to block committing the new locale. Heyyah Petr, thanks for reading and a reply. Give me a few days, and I'll try to test and fix this patch to compile using 2.17. I cannot vouch for testing of others who saw and gave approval of this patch. I did it myself as I was displeased of the state of hr_HR locale back then. I was mainly interested in collation, but did a lot more research then intended, and in turn patched all categories of locale. During that, I've cleaned, commented and trimmed the locale file considerably. bye for now, N:: Hi! Yes, I fully appreciate your efforts - I just want to confirm the status of the new locale regarding how it has been tested. Glad you decided to update your version of the locale, we will be looking forward to the new version. I can't think of specific updates that would be required for 2.17 (there were no changes in stock hr_HR since 2009), so mainly making it compile would be great. Created attachment 6876 [details]
An updated version of hr_HR which solves problems with LC_COLLATE, LC_ADDRESS and LC_TELEPHONE sections
This is promised update to the hr_HR locale.
Changes are:
- bumped revision to 2.1 and a date to current date
- removed duplicate character transliterations from LC_CTYPE which are found in i18n
- changed LC_COLLATE error, and tested with the small Croatian dictionary provided in 2009 using "sort -R dict_file > scrambled_file; sort scrambled_file > sorted_file" md5 sums of original <dict_file> and <sorted_file> are the same
- updated some comments, and some spacings
- changed thousands_sep and mon_thousands_sep to " " instead of "." char to comply with the suggestions in language books published since 2009.
- updated LC_ADDRESS to remove %n (persons name) field since it's not yet available in the code. Other locales fall back to %a (care of person or organization) and that's ok for now.
- cleaned LC_TELEPHONE by removing %t (space or null string) and %e (extension) fields which are currently unsupported in the code. Falled back to "+%c %a %l" and "%A %l" as seen in other locales.
Locale now compiles cleanly using localedef...
Created attachment 6877 [details]
An updated version of hr_HR which solves problems with LC_COLLATE, LC_ADDRESS and LC_TELEPHONE sections
Fixed small typos in comments...
Reset the bug status to "NEW", to signify it's ready for review by mainteiners of the library...
Thanks for your time,
N::
Will you accept this patch? It also fixes #15264 Created attachment 7010 [details]
Updated version of hr_HR
Removed CARNet as source of the locale, and their address since I don't have any official relation to them, and the locale is completely changed.
Small fixes in comments of the locale
Bumped version to 2.2 and date to 2013-05-01
Where did it hang for so long? First weekday still wrong in Fedora after 1 year https://sourceware.org/bugzilla/show_bug.cgi?id=14892 *** Bug 14892 has been marked as a duplicate of this bug. *** week settings should be fixed by: https://sourceware.org/ml/libc-alpha/2016-04/msg00419.html Created attachment 9196 [details]
Added week and first_weekday to the locale
As requested, locale now contains missing "week" and "first_weekday" fields...
Created attachment 9197 [details]
Small patch removing duplicated fields
Small fix of removing multiple week and first_weekday...
SemiRocket and Mike, thank you for your interest in moving this from a deadpoint. If you find any mistakes, please let me now so we can finally ship this with glibc-2.24 and finally have clean, and more importantly correct locale.
Will that effect sorting order of the sort command from GNU/Linux command line? If yes, I'm waiting for that status to change to FIXED since 2014. :) I'm sorry for not being able to participate with constructive comment but hoping to keep this alive since last comment was made a year ago. Thanks In the #1 post from 2009, look under TESTING... there you have a sample using sort command... (In reply to Dragan Stanojevic - Nevidljivi from comment #29) > In the #1 post from 2009, look under TESTING... there you have a sample > using sort command... Had no idea it could work that way. This will save me a lot of trouble I'm going trough write now when sorting Croatian text. I'll try to contact you via e-mail because I have some more questions about localization files in general and I'm thinking about changing one so I need some help. Don't won't to spam this report as it serves different purpose. I just hope to see hr_HR.utf8 in Debian soon. Many thanks for help and effort. *** Bug 22518 has been marked as a duplicate of this bug. *** Created attachment 10651 [details]
0001-hr_HR-locale-various-updates-BZ-10580.patch
Created attachment 10652 [details]
0002-Add-test-case-for-collation-in-hr_HR-locale.patch
Created attachment 10653 [details]
0003-Fix-test-case-for-hr_HR-monetary-formatting.patch
Created attachment 10654 [details]
0004-hr_HR-locale-fix-collation-and-expand-collation-test.patch
The patches attached to comment#32, comment#33, comment#34, and comment#35 : 0001-hr_HR-locale-various-updates-BZ-10580.patch 0002-Add-test-case-for-collation-in-hr_HR-locale.patch 0003-Fix-test-case-for-hr_HR-monetary-formatting.patch 0004-hr_HR-locale-fix-collation-and-expand-collation-test.patch update Dragan Stanojevic’s patch to current glibc master. This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 5e56e937c9144e70a16793d2c5aa22d1bd0b2c18 (commit) via cf4341ca90164398c05e74f72ff19dc52136731c (commit) via 9ca6b343783236fda88e9712f29b46ec875d4156 (commit) via 37075ae18d10802b9d62db3fbc910b30e01398d4 (commit) from f33632ccd1dec3217583fcfdd965afb62954203c (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5e56e937c9144e70a16793d2c5aa22d1bd0b2c18 commit 5e56e937c9144e70a16793d2c5aa22d1bd0b2c18 Author: Mike FABIAN <mfabian@redhat.com> Date: Thu Nov 30 12:13:02 2017 +0100 hr_HR locale: fix collation and expand collation test file * localedata/locales/hr_HR (LC_COLLATE): Fix collation to make test case pass. * localedata/hr_HR.UTF-8.in: Add more test strings. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cf4341ca90164398c05e74f72ff19dc52136731c commit cf4341ca90164398c05e74f72ff19dc52136731c Author: Mike FABIAN <mfabian@redhat.com> Date: Thu Nov 30 10:50:44 2017 +0100 Fix test case for hr_HR monetary formatting * stdlib/tst-strfmon_l.c: Fix testcase. Needed because of [BZ #10580] https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9ca6b343783236fda88e9712f29b46ec875d4156 commit 9ca6b343783236fda88e9712f29b46ec875d4156 Author: Dragan Stanojević - Nevidljivi <invisible@hidden-city.net> Date: Thu Nov 30 10:02:55 2017 +0100 Add test case for collation in hr_HR locale * localedata/Makefile: Add hr_HR.UTF-8 to test-input and to the list of locales to built for testing. * localedata/hr_HR.UTF-8.in: New file. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=37075ae18d10802b9d62db3fbc910b30e01398d4 commit 37075ae18d10802b9d62db3fbc910b30e01398d4 Author: Dragan Stanojević - Nevidljivi <invisible@hidden-city.net> Date: Thu Nov 30 09:14:51 2017 +0100 hr_HR locale: various updates [BZ #10580] [BZ #10580] * localedata/locales/hr_HR (LC_COLLATE): Base collation rules on iso14651_t1. * localedata/locales/hr_HR (LC_TIME): Sync month and day names with CLDR (except use ligatures for the digraphs, CLDR does not use the ligatures), add first_workday, some fixes in the date and time formats. * localedata/locales/hr_HR (LC_CTYPE): Add transliteration rules for Đ and đ. * localedata/locales/hr_HR (LC_MONETARY): Change currency_symbol to lower case. p_cs_precedes and n_cs_precedes should be 0 instead of 1. Add int_p_cs_precedes and int_n_cs_precedes. * localedata/locales/hr_HR (LC_NUMERIC): Change thousands_sep to "<U202F>" (NARROW NO-BREAK SPACE) and grouping to 3;3 (Agrees with LC_MONETARY now). * localedata/locales/hr_HR (LC_TELEPHONE): Add tel_dom_fmt. * localedata/locales/hr_HR (LC_NAME): Add name_mr, name_mrs, and name_miss. * localedata/locales/hr_HR (LC_ADDRESS): Add country_post, country_isbn, and lang_lib. Change postal_fmt. change ----------------------------------------------------------------------- Summary of changes: ChangeLog | 39 + localedata/Makefile | 4 +- localedata/hr_HR.UTF-8.in | 70 ++ localedata/locales/hr_HR | 2324 ++++----------------------------------------- stdlib/tst-strfmon_l.c | 8 +- 5 files changed, 303 insertions(+), 2142 deletions(-) create mode 100644 localedata/hr_HR.UTF-8.in Fixed in glibc master. This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 96b06a19e602557bfa668ad9c1a9f29044d3e774 (commit) via 1f6d91f328b7699610210d7d56d2cc49d60e1c27 (commit) from 2e49fed84c9ada0ad54445d197060dc28ee94103 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96b06a19e602557bfa668ad9c1a9f29044d3e774 commit 96b06a19e602557bfa668ad9c1a9f29044d3e774 Author: Mike FABIAN <mfabian@redhat.com> Date: Mon Dec 4 17:46:28 2017 +0100 tr_TR locale: Base collation on iso14651_t1 [BZ #22527] [BZ #22527] * localedata/locales/tr_TR (LC_COLLATE): Base collation rules on iso14651_t1. A test file localedata/tr_TR.UTF-8.in is already available, this rewrite of the collation rules does reproduce the test file in the same order. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1f6d91f328b7699610210d7d56d2cc49d60e1c27 commit 1f6d91f328b7699610210d7d56d2cc49d60e1c27 Author: Mike FABIAN <mfabian@redhat.com> Date: Mon Dec 4 13:10:29 2017 +0100 hr_HR locale: Don’t use single code points for the digraphs in LC_TIME [BZ #10580] * localedata/locales/hr_HR (LC_TIME): Use two letters for the digraphs in the month and day names. Using single code points for digraphs is deprecated. While there are dedicated Unicode codepoints, for the digraphs, these are included for backwards compatibility and modern texts use a sequence of Basic Latin characters. See: https://www.unicode.org/faq/ligature_digraph.html This makes the month and day names agree exactly with CLDR now, CLDR does not use the single code points for the digraphs either. ----------------------------------------------------------------------- Summary of changes: ChangeLog | 20 + localedata/locales/hr_HR | 18 +- localedata/locales/tr_TR | 2112 ++-------------------------------------------- 3 files changed, 82 insertions(+), 2068 deletions(-) Big thanks to Mike FABIAN for working on resolving this, and being through with the ending solution by brainstorming on digraphs usage, making locale more in line with CLDR, and making it more practical by avoiding digraphs in LC_TIME... (In reply to cvs-commit@gcc.gnu.org from comment #39) > [...] > commit 1f6d91f328b7699610210d7d56d2cc49d60e1c27 > Author: Mike FABIAN <mfabian@redhat.com> > Date: Mon Dec 4 13:10:29 2017 +0100 > > hr_HR locale: Don’t use single code points for the digraphs in LC_TIME > > [BZ #10580] > * localedata/locales/hr_HR (LC_TIME): Use two letters for the > digraphs in the month and day names. Using single code points for > digraphs is deprecated. While there are dedicated Unicode > codepoints, for the digraphs, these are included for backwards > compatibility and modern texts use a sequence of Basic Latin > characters. See: https://www.unicode.org/faq/ligature_digraph.html > This makes the month and day names agree exactly with CLDR now, > CLDR does not use the single code points for the digraphs either. > [...] Before this change all abmon items (abbreviated month names) were 3 letters long. Now all are 3 letters long except the second item (February, Feb) which is "velj", 4 letters long. Previously it was "velj" therefore 3 letters. Dragan, wouldn't you prefer it to be "vel", consequently 3 letters long? The page https://vlada.gov.hr/ uses "Vel". CLDR uses "velj" so if you'd like this change I suggest creating a new ticket in in CLDR first: http://unicode.org/cldr/trac/newticket > Before this change all abmon items (abbreviated month names) were 3 letters
> long. Now all are 3 letters long except the second item (February, Feb)
> which is "velj", 4 letters long. Previously it was "velj" therefore 3
> letters. Dragan, wouldn't you prefer it to be "vel", consequently 3 letters
> long?
True, before this change all were 3 letters, but through discussion with Mike several arguments were made against using digraphs in LC_TIME:
- Unicode has since moved away from promoting them
- They have a lot of problems with digraphs and even tried to solve it with: "U+034F COMBINING GRAPHEME JOINER" fix, so that digraphs would be glued with it, but still written as two separate letters.
- Digraphs often look ugly in fonts, or are not contained in them so they're substituted from another font, terminals in general don't have Unicode fonts, and in TUI apps, it is better not to force digraphs, example would be `cal` or TUI mail clients, shell prompt, tmux, ...
- abbreviations in many glibc locales isn't 3 letters. There is no rule that they need to be, they just need to be shorter.
- I have wrongly assumed all abbr. needed to be of same length, they don't. If that was the case I'd be more stubborn on digraphs, this way I'm more in favor of "Velj".
- Many applications and many programmers decided to avoid glibc locale since it was ugly. They either decided to make their own (LibreOffice for example), or they do something like taking first 3 letters of a month or day name, giving them wrong "Vel" values. "lj" is a digraph and a distinct phoneme, sounding different from simple "l". IMO "Vel" is more wrong then "Velj".
- Glibc and CLDR were once very stern in what they'd accept. Now they've become more pragmatic. One result is this issue with digraphs, but I hope that it is clear that it was done with end users in mind. There are not many Unicode digraphs used. And people will continue to type two letters for them since entering digraphs is still awkward.
In the end, this patch was done more than 8 years ago. It was a complete rewrite of the old locale and intention was to make it correct and easy to read/maintain. During those 8+ years several bugs were issued towards hr_HR and all of them were dups of this one, since I've solved all the issues back then. Yet so many maintainers avoided this patch for one reason or the other. During discussion with Mike, I really wasn't into forcing digraphs except in LC_COLLATE, since that would be awkward for end users, and most other locales avoid digraphs anyway. Even Unicode FAQ notes that they're troublesome in so many practical ways.
In the end, I'm open to thoughts and arguments of others, especially end users, but this patch, in any conceivable way compared to the previous state, is a huge push towards maintainable and clear hr_HR locale.
(In reply to Rafal Luzynski from comment #41) > Before this change all abmon items (abbreviated month names) were 3 letters > long. Now all are 3 letters long except the second item (February, Feb) > which is "velj", 4 letters long. Previously it was "velj" therefore 3 > letters. Dragan, wouldn't you prefer it to be "vel", consequently 3 letters > long? No, I don’t think this makes sense because lj belongs together, one should not cut this digraph in the middle. Several other locales also have abbreviations for the month and day names longer than 3 characters. I think that is OK if it makes no sense to cut off after 3 characters. That's OK, if "lj" is a digraph which should not be split and "vel" is not correct and "velj" is the correct abbreviation then let's leave it as is. |