Bug 10580

Summary:	hr_HR: updated locale
Product:	glibc	Reporter:	Dragan Stanojevic - Nevidljivi <invisible>
Component:	localedata	Assignee:	Mike FABIAN <maiku.fabian>
Status:	RESOLVED FIXED
Severity:	enhancement	CC:	digitalfreak, glibc-bugs, kruno.se, maiku.fabian, pasky, semiRocket, tvujec
Priority:	P2	Flags:	fweimer: security-
Version:	unspecified
Target Milestone:	2.27
Host:		Target:
Build:		Last reconfirmed:
Attachments:	new hr_HR locale file Small Croatian dictionary for testing A new version of a hr_HR locale (with lowecased day and month names) An updated version of hr_HR which solves problems with LC_COLLATE, LC_ADDRESS and LC_TELEPHONE sections An updated version of hr_HR which solves problems with LC_COLLATE, LC_ADDRESS and LC_TELEPHONE sections Updated version of hr_HR Added week and first_weekday to the locale Small patch removing duplicated fields 0001-hr_HR-locale-various-updates-BZ-10580.patch 0002-Add-test-case-for-collation-in-hr_HR-locale.patch 0003-Fix-test-case-for-hr_HR-monetary-formatting.patch 0004-hr_HR-locale-fix-collation-and-expand-collation-test.patch

Description Dragan Stanojevic - Nevidljivi 2009-08-31 22:54:19 UTC

----------------
HISTORY:
----------------

hr_HR locale started out as a copy of sl_SI locale in glibc-2.0 and was
maintained by Borka Jerman-Blažič (from Slovenia), shortly afterwards Tomislav
Vujec (then at CARNet, now in RedHat) changed it to suite hr specific changes.
After around 1998, that locale was only updated by glibc maintainer Ulrich
Drapper who added or changed portions of it as mass updates to many locales.

I have contacted current maintainer, Tomislav Vujec, last week and he is willing
to support changes. Also, since it's been more then a decade since he did
changes to this locale, he noted that he'd be willing to pass maintainership to
someone else. BTW, he is also maintainer of bs_BA, I hope Bosnian translation
team will take over maintenance of that locale...


----------------
RATIONALE:
----------------

The point is: hr_HR locale is now in a state of flux. It kind of works and fails
in fairly subtle ways when sorting digraphs. I have made numerous changes which
I'll describe below...

Croatia doesn't have language law or any real specification of the language
rules for writing dates, monetary data and so on. Most of the language decisions
in real life are made using common established conventions. I'll rationale my
decisions in my change descriptions below, using URLs where needed...

I really wanted to make this right, so I've read all of the archive of
libc-locale mailing list (2004-now), and also ISO/IEC TW 14652 (albeit 2002
edition which I found for free on the Internet). I've looked at history of
changes of hr_HR locale through "git blame". I've also studied at sr_RS locale
which is somewhat related to hr_HR since Croatian, Bosnian and Serbian have (or
had) lot of common conventions. Initially I only wanted to change LC_COLLATE,
but it made sense to update locale as a whole, so it required far more time than
I anticipated, but changes made are worth it. I've (heavily) commented the
locale, so it should be easy to maintain from now. UTF-8 characters are used
only in comments). I've also contacted all hr translators team leaders and
pointed them to this bugreport to give their opinion on these changes, since
they will be system wide when accepted, and they are, by definition, at the
forefront of i18n and l10n efforts.

There are some general locale system errors which are not specific to Croatian
locale, so if Ulrich Drapper (if he has some time) or someone else versed in in
glibc internals can look at change descriptions to LC_COLLATE, LC_ADDRESS and
LC_TELEPHONE, and help me a bit with system errors found there while using
localedef I'd be really thankful :o)


----------------
CHANGES:
----------------

% <initial comments>
	Mostly cleaned comments and removed repeatable information into
LC_IDENTIFICATION. Added that charset used in croatia should primarily be UTF-8.
Previously we used ISO-8859-2 (which should be phased out since it doesn't
support digraph characters [ǆ, ǉ and ǌ])
	Added my email in the authors list, just so I can be notified in the future
when the locale changes.

LC_IDENTIFICATION:
	I bumped revision to 2.0 (from 1.0) since this is a major rewrite of this locale.
	I have left "CARNet" and their address although I'm not really sure why CARNet
(Croatian Academic and Research Network) would have jurisdiction over hr_HR
locale. Not even Ministry of Education of Croatia has jurisdiction over it, as
they don't supply rules for writing dates, or monetary strings for example.
	category statements were updated to reflect new changes. Standard requires
first parameter to define to which standard this category complies but all other
locales just use locale name and a year here, so I did just that too.
	BTW most locales don't list all categories which are include in their file. For
example, they usually don't include LC_MEASUREMENT. I did...

LC_CTYPE:
	Although ISO/IEC TR 14652 has controversial LC_XLITERATE category, glibc uses
"translit_start" inside LC_CTYPE. Hence I've added transliteration info (how to
transliterate digraphs to ISO-8859-2 and ASCII). I'm not really sure how to test
this, I hope I got it right.
	There is some weird behaviour in included "i18n"... For example it has same
character in "upper" AND "lower" class, so both iswupper() and iswlower() give
TRUE for <U01C5> {ǅ}. I guess this is ok.
	Another behaviour is that towupper() will make <U01C6> {ǆ} -> <U01C4> {Ǆ},
which can be wrong in some cases where <U01C5> {ǅ} is needed. This is not ok,
but not fixable in the current implementation anyway, so lets add it to
curiosities for now :o)

LC_COLLATE:
	Major revision. I have included "iso14651_t1" like most locales to reap
benefits of "iso14651_t1" updates, as well as to significantly reduce hr_HR
locale size and increase readability
	collating-elements are created and linked to the right digraphs [ǆ, ǉ and ǌ]
	BTW "collating-element" shouldn't be used after "copy", but many locales use it
since there is no other way, except putting them in "iso14651_t1"
	Croatian alphabet considers č, ć, ǆ, đ, ǉ, ǌ, š and ž distinct letters, and
that was implemented with reorder-after statements
	localedef says I have SYNTAX ERROR in LC_COLLATE, probably not liking "<d><z>"
digraph literal. Is this really SYNTAX ERROR? It works though...

LC_TIME:
	Names of days and months are now written with right digraphs, and not a
combination of ASCII letters ( Digraphs can nowdays be seen in CLI apps as well.
For example `cal 2009`.
	d_t_fmt was changed to format like: "Ponedjeǉak, 31. Kolovoz 2009. 16:35:05
CEST" (The best we can in current implementation. Croatia uses declension in
month names like most Slavic languages) [ This format can be seen on Croatian
government pages http://vlada.hr/ ]
	date_fmt was changed to format like: "Pon, 31.08.2009.  16:49:36 CEST" [
Croatia in general doesn't use short versions of month or day names. For month
we usually use number as seen on pages of Croatian president [
http://www.predsjednik.hr/ ]
	d_fmt is changed to format like: "01.09.2009." for reasons same as in date_fmt
change explanation. Croatians read and write dd.mm.yyyy format for decades. If
someone objects that it confuses people who use mm.dd.yyyy (us) format, I agree,
but this is hr_HR locale and this form is widely used in Croatia. System
software should use YYYY-MM-DD format anyway regardles of locale.
	t_fmt is changed to format like: "HH:MM:SS"
	I've added week, first_weekday, first_workday. first_weekday and first_workday
are set to Monday

LC_NUMERIC
	I've set thousands_sep to '.' So formating of numbers is "12.345.678,90" or
"-12.345.678,90"

LC_MONETARY
	I've lowecased currency_symbol to "kn" since that form is what majority of
citizens/shops nowdays use. See online shops: http://www.links.hr/ ,
http://www.profil.hr/ , and many others. You can see there is no rule for this
at wikipedia: http://hr.wikipedia.org/wiki/Hrvatska_kuna , where they note that
Symbol is "Kn" but use "kn" a lot on the same page
	I've added thousands_sep to '.' as in LC_NUMERIC
	I've changed monetary string format to: "14.986,42 kn", "-14.986,42 kn" and for
international to "HRK 14.986,42" and "-HRK 14.986,42" as was agreed upon in 2003
by Tomislav Vujec on libc-alpha [
http://sourceware.org/ml/libc-alpha/2003-04/msg00254.html ]. I'm not really sure
that in international version HRK should be before the value (as said at the
top, there is no law on how to write monetary values in Croatia just
conventions). I'd leave them the same as local versions, and just use HRK
instead of kn but I've complied with libc-aplha agreement of 2003 for now.

LC_MESSAGES:
	I've removed trailing .* in yesexpr and noexpr as it was discussed in
libc-locales mailing list [
http://sources.redhat.com/bugzilla/show_bug.cgi?id=71 ] that it's not really
necessary.
	I didn't include 1 in yesexpr and 0 in noexpr although this was discussed in
libc-locales mailing list too. But not many locales use it, so I've skipped it
for now
	I've added yesstr, and nostr

LC_NAME:
	Changes name_fmt to "salutation name other_name surnames"
	I've added name_mr, name_mrs, and name_miss. Croatia doesn't have gender
neutral salutation, nor neutral female (name_ms) version of salutation

LC_ADDRESS:
	postal_fmt is changed, so that address now looks like:
		Company name
		Department name
		Person's name
		C/O Person or Organization
		Street name and house number
		ZIP Code and City name
		Country
	localedef complains that postal_fmt have invalid escape sequence, I don't know
why!?!
	I've added definitions for many missing attributes: country_post, country_car,
country_isbn, lang_name, lang_ab, lang_term and lang_lib

LC_TELEPHONE
	I've changed tel_int_fmt to look like: "+<country code> <area code without
leading 0> <local number> < possible ext>"
	I've changed tel_dom_fmt to look like: "<possible area code with leading 0>
<local number> <possible ext>"
	localedef complains that tel_int_fmt and tel_dom_fmt have invalid escape
sequence, I don't know why!?!

LC_PAPER
	A4 is used in Croatia

LC_MEASUREMENT
	Croatia uses metric measurements


----------------
TESTING:
----------------

To see the file without Uxxx literals, I made this ugly oneliner which make HTML
version of it. Just change file name at the start and you can use it with other
locale files as well.

( FILE=hr_HR; sed -e
's/<U\([0-9A-F][0-9A-F][0-9A-F][0-9A-F]\)>/\&lt;\&#x\1;\&gt;/g' < $FILE >
$FILE.tmp; sed -e 's/</\&lt;/g' < $FILE.tmp > $FILE.html; sed -e 's/>/\&gt;/g' <
$FILE.html > $FILE.tmp; echo "<pre>" > $FILE.html; cat $FILE.tmp >> $FILE.html;
rm $FILE.tmp )

Also to test collating in hr_HR locale I made small dictionary which has
Croatian digraphs in all forms, as well as letters which are considered distinct.
To test collation with it I do the following: randomize it with `sort -R`, and
resort it. The end file should have same MD5 as the starting one...

Testing of other locale categories is a bit harder, but small C programs work
well, and most code templates you have in glibc source / localedata anyway.

Comment 1 Dragan Stanojevic - Nevidljivi 2009-08-31 22:57:36 UTC

Created attachment 4158 [details]
new hr_HR locale file

This is not a patch file since it's completely new file and patch would be huge
without valid reason for it. This file is just 1/7th of the file it replaces!

Comment 2 Dragan Stanojevic - Nevidljivi 2009-08-31 23:00:33 UTC

Created attachment 4159 [details]
Small Croatian dictionary for testing

Small dictionary of already sorted Croatian words which have digraphs, it's
variations and letters which are considered distinct in an Croatian alphabet
which affects sorting.

Use `sort -R` to randomize it, and `sort` to check you get the same version
back

Comment 3 Tomislav Vujec 2009-09-01 02:46:31 UTC

Just a note that I fully support the changes.

Further more, since I moved out of Croatia 10 years ago, I am unable to stay in
sync with language and locale relevant policies and rules. Therefore, I would
like to ask that a new maintainer is selected. I don't know if there is an
official process for this now days, but since Dragan did all this work, I would
like to support him if he wants to take over that role.

Comment 4 Dragan Stanojevic - Nevidljivi 2009-09-01 12:15:14 UTC

Thank you Tomislav for your support,

KDE l10n team leader contacted me, I'm still waiting for GNOME l10n team to have
their say.

As for new maintainership, I'm willing to accept it for hr_HR locale

Comment 5 Dragan Stanojevic - Nevidljivi 2009-09-03 01:07:41 UTC

GNOME translation team still hasn't responded to my query.

Instead I have contacted Croatian Ubuntu team which also does translation work,
as well as Croatian Linux news group.

So lets wait for few more days...

Comment 6 Dragan Stanojevic - Nevidljivi 2009-09-06 22:11:22 UTC

Created attachment 4175 [details]
A new version of a hr_HR locale (with lowecased day and month names)

Comment 7 Dragan Stanojevic - Nevidljivi 2009-09-06 22:13:06 UTC

Apart for lowercasing day and month names, nobody had any objections to this new
version of locale.

I think this can be commited to libc-locales

Thank you all for your time,
N::

Comment 8 Ulrich Drepper 2009-10-29 23:31:23 UTC

The locale doesn't compile correctly:

/home/drepper/gnu/libc/localedata/locales/hr_HR:143: LC_COLLATE: syntax error
LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt'
LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt'
LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt'
no output file produced because warnings were issued

Comment 9 Dragan Stanojevic - Nevidljivi 2009-10-30 01:35:16 UTC

(In reply to comment #8)
> The locale doesn't compile correctly:
> 
> /home/drepper/gnu/libc/localedata/locales/hr_HR:143: LC_COLLATE: syntax error
> LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt'
> LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt'
> LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt'
> no output file produced because warnings were issued

Hi Ulrich, thanks for your time and reply...

I'm aware of this errors, but they are more system wide errors than hr_HR ones.
I wrote about them in the long explanation... I'm quite sure you didn't have
time to read it all, but let me repeat the last paragraph of RATIONALE which is
of importance here:

------------------------------------------------------------------------------
There are some general locale system errors which are not specific to Croatian
locale, so if Ulrich Drapper (if he has some time) or someone else versed in in
glibc internals can look at change descriptions to LC_COLLATE, LC_ADDRESS and
LC_TELEPHONE, and help me a bit with system errors found there while using
localedef I'd be really thankful :o)
------------------------------------------------------------------------------

Allow me to elaborate just a bit to make it easier for you:

"hr_HR:143: LC_COLLATE: syntax error" ::
I've used quotes to mark digraph <d><z>. I used that designation since the same
designation is used in "iso14651_t1_common" file...
Look with: `grep '<d><z>' iso14651_t1_common`

"LC_ADDRESS: invalid escape `%n' sequence in field `postal_fmt'" ::
%n is a valid escape sequence per "ISO/IEC TW 14652". It states: 
"%n -- Person's name, possibly constructed with the LC_NAME "name_fmt" keyword"

LC_TELEPHONE: invalid escape sequence in field `tel_int_fmt'
and
LC_TELEPHONE: invalid escape sequence in field `tel_dom_fmt' ::
Again, as per ISO/IEC TW 14652, it contains no invalid escape sequence... %c %a
%A %l %e and %t are mentioned in standard.

Thank you once more...
N::

Comment 10 Petr Baudis 2013-02-16 00:39:45 UTC

Dragan, thank you for your work.

It is true that the locales in glibc are not fully ISO/IEC 14652 compliant, in particular some fields that should be used in fact are not. I'm not personally sure why this is the case, probably it's for purely historical reasons. However, I believe the greatest value lies in consistency, and if no current locales use %n in postal_fmt and %e and %t in tel_*_fmt, neither should hr_HR as the programs using these locales probably do not expect to find these field descriptors there.

So let's not conflate the issue of unsupported field descriptors with the new hr_HR locale; could you please submit an hr_HR locale version that does not use these field descriptors? Since you got a buy-in from other Croatians active in this area, I think we can commit the new locale speedily afterwards.

(Regarding the issue of unsupported field descriptors, if you are interested in pursuing that further. A simple technical fix is to simply patch locale/programs/ld-{telephone,address}.c to allow these. However, we should do this with consideration to locale consistency and current usage of these categories in programs. This needs to be researched and I think the next reasonable step is to document the currently supported field descriptors in "glibc style locales". We can then think of how to proceed further while our users will already have a valuable reference. This process can be done gradually, category by category. Does that sound sensible?)

Comment 11 keld@keldix.com 2013-02-16 18:49:08 UTC

On Sat, Feb 16, 2013 at 12:39:45AM +0000, pasky at ucw dot cz wrote:
> (Regarding the issue of unsupported field descriptors, if you are interested in
> pursuing that further. A simple technical fix is to simply patch
> locale/programs/ld-{telephone,address}.c to allow these. However, we should do
> this with consideration to locale consistency and current usage of these
> categories in programs. This needs to be researched and I think the next
> reasonable step is to document the currently supported field descriptors in
> "glibc style locales". We can then think of how to proceed further while our
> users will already have a valuable reference. This process can be done
> gradually, category by category. Does that sound sensible?)

I would rather take another approach, and that would be to further implement
ISO TR 14652 or the new version thereof, ISO TR 30112. ISO TR 30112 is closer
to glibc, as some things that glibc implements is now specified in
30112, including LC_PAPER. 

Best regards
Keld

Comment 12 Dragan Stanojevic - Nevidljivi 2013-02-16 19:10:32 UTC

Thank you for your comments!

So if I understand correctly, I just need to trim LC_ADDRESS and LC_TELEPHONE to comply with current support in glibc, and you'll accept the whole patch?

That would be great, since it resolves a lot of issues, shortens the file, makes it more manageable for future changes, and so on...

Back then I've read whole ISO/IEC TW 14652, tried to mimic other locale format as much as possible and I think I made good patch. In the end I thought I'd need to learn flex & bison to improve glibc parsing of those data, but that was beyond me.

Comment 13 Petr Baudis 2013-02-17 00:13:28 UTC

Keld, of course using the newer standard makes sense; however, I'm not sure what do you mean by "further implement" and how that differs from what I wrote. If you are interested in discussing this further, I propose we move the discussion to the mailing list where more people could follow it. (Note that I myself don't have the time to pursue the issue itself, so it makes sense to talk more about it only if someone intends to do anything about it.)

Dragan, I'm sorry, I missed the LC_COLLATE syntax error. Any reason why we cannot use the unicode entity there instead?

Also, I'm wondering, how was testing of this locale done if it doesn't even compile with glibc's localedef now? And which of the people that provided support for the new locale actually tested it rather than just embraced the idea?

Comment 14 keld@keldix.com 2013-02-17 00:41:53 UTC

On Sun, Feb 17, 2013 at 12:13:28AM +0000, pasky at ucw dot cz wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=10580
> 
> --- Comment #13 from Petr Baudis <pasky at ucw dot cz> 2013-02-17 00:13:28 UTC ---
> Keld, of course using the newer standard makes sense; however, I'm not sure
> what do you mean by "further implement" and how that differs from what I wrote.
> If you are interested in discussing this further, I propose we move the
> discussion to the mailing list where more people could follow it. (Note that I
> myself don't have the time to pursue the issue itself, so it makes sense to
> talk more about it only if someone intends to do anything about it.)

So where should we do the discussion? I did think that
this list was relevant. Anyway, the differences are not big,..
It is mostly to align with current glibc implementation, and then
introduce 2 novelties.

Best regards
Keld

Comment 15 Dragan Stanojevic - Nevidljivi 2013-02-17 13:56:29 UTC

Hi,

let me be frank. This was made in 2009. I've spend at least a week reading ISO documents, comparing to other locales similar to hr_HR, contacting Croatian Linux User Group and writing tests.

Every question considering compile errors was answered in the huge description of the patch, and repeated on comment #9, since obviously Drepper didn't read it in the first place when he dissed the patch.

If you don't want LC_ADDRESS or LC_TELEPHONE, copy them from C locale. If you don't want to implement "<d><z>", comment it out...

Also there is no standard test suite for this locale categories. I find it hard to believe that I (or any locale writer) have to write custom test suites from scratch again, nor do I have the time.

I repeat, this patch was a big improvement in 2009. I don't have time to again write test suites from scratch. Let alone to reread ISO documents, and patch libc itself.

It's your choice will you ever apply this patch.

Comment 16 Petr Baudis 2013-02-17 15:05:21 UTC

I have read everything you have written in this bugreport; I might have missed something, but I asked my questions because I believe they weren't answered in the previous comments.

My question was not geared at test suites, though I appreciate your effort to test the collation rules. I was just wondering whether and how this locale (considering that it cannot be compiled by localedef as it is now) was tried out with actual commonly used software, and whether that was done just by you or by the other people supporting it too.

If you could adjust the locale into a compilable form, we can easily ask other to test it so that we can incorporate any bugfixes before the next release; this (besides few simple sanity checks I'll do) does not need to block committing the new locale.

Comment 17 Dragan Stanojevic - Nevidljivi 2013-02-17 18:16:42 UTC

Heyyah Petr,

thanks for reading and a reply. Give me a few days, and I'll try to test and fix this patch to compile using 2.17.

I cannot vouch for testing of others who saw and gave approval of this patch. I did it myself as I was displeased of the state of hr_HR locale back then.

I was mainly interested in collation, but did a lot more research then intended, and in turn patched all categories of locale. During that, I've cleaned, commented and trimmed the locale file considerably.

bye for now,
N::

Comment 18 Petr Baudis 2013-02-17 18:23:29 UTC

Hi! Yes, I fully appreciate your efforts - I just want to confirm the status of the new locale regarding how it has been tested.

Glad you decided to update your version of the locale, we will be looking forward to the new version. I can't think of specific updates that would be required for 2.17 (there were no changes in stock hr_HR since 2009), so mainly making it compile would be great.

Comment 19 Dragan Stanojevic - Nevidljivi 2013-02-19 06:32:41 UTC

Created attachment 6876 [details]
An updated version of hr_HR which solves problems with LC_COLLATE, LC_ADDRESS and LC_TELEPHONE sections

This is promised update to the hr_HR locale.

Changes are:
- bumped revision to 2.1 and a date to current date
- removed duplicate character transliterations from LC_CTYPE which are found in i18n
- changed LC_COLLATE error, and tested with the small Croatian dictionary provided in 2009 using "sort -R dict_file > scrambled_file; sort scrambled_file > sorted_file" md5 sums of original <dict_file> and <sorted_file> are the same
- updated some comments, and some spacings
- changed thousands_sep and mon_thousands_sep to " " instead of "." char to comply with the suggestions in language books published since 2009.
- updated LC_ADDRESS to remove %n (persons name) field since it's not yet available in the code. Other locales fall back to %a (care of person or organization) and that's ok for now.
- cleaned LC_TELEPHONE by removing %t (space or null string) and %e (extension) fields which are currently unsupported in the code. Falled back to "+%c %a %l" and "%A %l" as seen in other locales.

Locale now compiles cleanly using localedef...

Comment 20 Dragan Stanojevic - Nevidljivi 2013-02-19 06:50:42 UTC

Created attachment 6877 [details]
An updated version of hr_HR which solves problems with LC_COLLATE, LC_ADDRESS and LC_TELEPHONE sections

Fixed small typos in comments...

Reset the bug status to "NEW", to signify it's ready for review by mainteiners of the library...

Thanks for your time,
N::

Comment 21 Dragan Stanojevic - Nevidljivi 2013-04-07 03:22:14 UTC

Will you accept this patch? It also fixes #15264

Comment 22 Dragan Stanojevic - Nevidljivi 2013-05-01 17:23:29 UTC

Created attachment 7010 [details]
Updated version of hr_HR

Removed CARNet as source of the locale, and their address since I don't have any official relation to them, and the locale is completely changed.

Small fixes in comments of the locale

Bumped version to 2.2 and date to 2013-05-01

Comment 23 semiRocket 2014-06-15 08:03:25 UTC

Where did it hang for so long?

First weekday still wrong in Fedora after 1 year https://sourceware.org/bugzilla/show_bug.cgi?id=14892

Comment 24 Mike Frysinger 2016-04-15 17:58:49 UTC

*** Bug 14892 has been marked as a duplicate of this bug. ***

Comment 25 Mike Frysinger 2016-04-16 07:38:45 UTC

week settings should be fixed by:
  https://sourceware.org/ml/libc-alpha/2016-04/msg00419.html

Comment 26 Dragan Stanojevic - Nevidljivi 2016-04-16 15:02:20 UTC

Created attachment 9196 [details]
Added week and first_weekday to the locale

As requested, locale now contains missing "week" and "first_weekday" fields...

Comment 27 Dragan Stanojevic - Nevidljivi 2016-04-16 15:11:58 UTC

Created attachment 9197 [details]
Small patch removing duplicated fields

Small fix of removing multiple week and first_weekday...

SemiRocket and Mike, thank you for your interest in moving this from a deadpoint. If you find any mistakes, please let me now so we can finally ship this with glibc-2.24 and finally have clean, and more importantly correct locale.

Comment 28 Krunose 2017-04-06 20:08:55 UTC

Will that effect sorting order of the sort command from GNU/Linux command line? If yes, I'm waiting for that status to change to FIXED since 2014. :)

I'm sorry for not being able to participate with constructive comment but hoping to keep this alive since last comment was made a year ago.

Thanks

Comment 29 Dragan Stanojevic - Nevidljivi 2017-04-07 10:31:23 UTC

In the #1 post from 2009, look under TESTING... there you have a sample using sort command...

Comment 30 Krunose 2017-04-07 11:38:44 UTC

(In reply to Dragan Stanojevic - Nevidljivi from comment #29)
> In the #1 post from 2009, look under TESTING... there you have a sample
> using sort command...

Had no idea it could work that way. This will save me a lot of trouble I'm going trough write now when sorting Croatian text.

I'll try to contact you via e-mail because I have some more questions about localization files in general and I'm thinking about changing one so I need some help. Don't won't to spam this report as it serves different purpose.

I just hope to see hr_HR.utf8 in Debian soon.

Many thanks for help and effort.

Comment 31 Mike FABIAN 2017-11-30 08:16:04 UTC

*** Bug 22518 has been marked as a duplicate of this bug. ***

Comment 32 Mike FABIAN 2017-11-30 11:47:55 UTC

Created attachment 10651 [details]
0001-hr_HR-locale-various-updates-BZ-10580.patch

Comment 33 Mike FABIAN 2017-11-30 11:48:22 UTC

Created attachment 10652 [details]
0002-Add-test-case-for-collation-in-hr_HR-locale.patch

Comment 34 Mike FABIAN 2017-11-30 11:49:01 UTC

Created attachment 10653 [details]
0003-Fix-test-case-for-hr_HR-monetary-formatting.patch

Comment 35 Mike FABIAN 2017-11-30 11:49:30 UTC

Created attachment 10654 [details]
0004-hr_HR-locale-fix-collation-and-expand-collation-test.patch

Comment 36 Mike FABIAN 2017-11-30 11:51:15 UTC

The patches attached to comment#32, comment#33, comment#34, and comment#35 :

0001-hr_HR-locale-various-updates-BZ-10580.patch
0002-Add-test-case-for-collation-in-hr_HR-locale.patch
0003-Fix-test-case-for-hr_HR-monetary-formatting.patch
0004-hr_HR-locale-fix-collation-and-expand-collation-test.patch

update Dragan Stanojevic’s patch to current glibc master.

Comment 37 Sourceware Commits 2017-11-30 14:23:50 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  5e56e937c9144e70a16793d2c5aa22d1bd0b2c18 (commit)
       via  cf4341ca90164398c05e74f72ff19dc52136731c (commit)
       via  9ca6b343783236fda88e9712f29b46ec875d4156 (commit)
       via  37075ae18d10802b9d62db3fbc910b30e01398d4 (commit)
      from  f33632ccd1dec3217583fcfdd965afb62954203c (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5e56e937c9144e70a16793d2c5aa22d1bd0b2c18

commit 5e56e937c9144e70a16793d2c5aa22d1bd0b2c18
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Thu Nov 30 12:13:02 2017 +0100

    hr_HR locale: fix collation and expand collation test file
    
    	* localedata/locales/hr_HR (LC_COLLATE): Fix collation
    	to make test case pass.
    	* localedata/hr_HR.UTF-8.in: Add more test strings.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=cf4341ca90164398c05e74f72ff19dc52136731c

commit cf4341ca90164398c05e74f72ff19dc52136731c
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Thu Nov 30 10:50:44 2017 +0100

    Fix test case for hr_HR monetary formatting
    
    	* stdlib/tst-strfmon_l.c: Fix testcase. Needed because of [BZ #10580]

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9ca6b343783236fda88e9712f29b46ec875d4156

commit 9ca6b343783236fda88e9712f29b46ec875d4156
Author: Dragan Stanojević - Nevidljivi <invisible@hidden-city.net>
Date:   Thu Nov 30 10:02:55 2017 +0100

    Add test case for collation in hr_HR locale
    
    	* localedata/Makefile: Add hr_HR.UTF-8 to test-input and to
    	the list of locales to built for testing.
    	* localedata/hr_HR.UTF-8.in: New file.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=37075ae18d10802b9d62db3fbc910b30e01398d4

commit 37075ae18d10802b9d62db3fbc910b30e01398d4
Author: Dragan Stanojević - Nevidljivi <invisible@hidden-city.net>
Date:   Thu Nov 30 09:14:51 2017 +0100

    hr_HR locale: various updates [BZ #10580]
    
    	[BZ #10580]
            * localedata/locales/hr_HR (LC_COLLATE): Base collation rules on
            iso14651_t1.
            * localedata/locales/hr_HR (LC_TIME): Sync month and day names with
            CLDR (except use ligatures for the digraphs, CLDR does not use
            the ligatures), add first_workday, some fixes in the date and time
            formats.
            * localedata/locales/hr_HR (LC_CTYPE): Add transliteration rules
            for Đ and đ.
            * localedata/locales/hr_HR (LC_MONETARY): Change currency_symbol to
            lower case. p_cs_precedes and n_cs_precedes should be 0 instead of 1.
            Add int_p_cs_precedes and int_n_cs_precedes.
            * localedata/locales/hr_HR (LC_NUMERIC): Change thousands_sep to
            "<U202F>" (NARROW NO-BREAK SPACE) and grouping to 3;3 (Agrees with
            LC_MONETARY now).
            * localedata/locales/hr_HR (LC_TELEPHONE): Add tel_dom_fmt.
    	* localedata/locales/hr_HR (LC_NAME): Add name_mr, name_mrs, and
            name_miss.
    	* localedata/locales/hr_HR (LC_ADDRESS): Add country_post, country_isbn,
            and lang_lib. Change postal_fmt.
    
    change

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                 |   39 +
 localedata/Makefile       |    4 +-
 localedata/hr_HR.UTF-8.in |   70 ++
 localedata/locales/hr_HR  | 2324 ++++-----------------------------------------
 stdlib/tst-strfmon_l.c    |    8 +-
 5 files changed, 303 insertions(+), 2142 deletions(-)
 create mode 100644 localedata/hr_HR.UTF-8.in

Comment 38 Mike FABIAN 2017-11-30 14:24:50 UTC

Fixed in glibc master.

Comment 39 Sourceware Commits 2017-12-04 17:36:49 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  96b06a19e602557bfa668ad9c1a9f29044d3e774 (commit)
       via  1f6d91f328b7699610210d7d56d2cc49d60e1c27 (commit)
      from  2e49fed84c9ada0ad54445d197060dc28ee94103 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=96b06a19e602557bfa668ad9c1a9f29044d3e774

commit 96b06a19e602557bfa668ad9c1a9f29044d3e774
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Mon Dec 4 17:46:28 2017 +0100

    tr_TR locale: Base collation on iso14651_t1 [BZ #22527]
    
    	[BZ #22527]
    	*  localedata/locales/tr_TR (LC_COLLATE): Base collation rules
    	on iso14651_t1. A test file localedata/tr_TR.UTF-8.in is already
    	available, this rewrite of the collation rules does reproduce
    	the test file in the same order.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1f6d91f328b7699610210d7d56d2cc49d60e1c27

commit 1f6d91f328b7699610210d7d56d2cc49d60e1c27
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Mon Dec 4 13:10:29 2017 +0100

    hr_HR locale: Don’t use single code points for the digraphs in LC_TIME
    
    	[BZ #10580]
    	* localedata/locales/hr_HR (LC_TIME): Use two letters for the
    	digraphs in the month and day names. Using single code points for
    	digraphs is deprecated.  While there are dedicated Unicode
    	codepoints, for the digraphs, these are included for backwards
    	compatibility and modern texts use a sequence of Basic Latin
    	characters. See: https://www.unicode.org/faq/ligature_digraph.html
    	This makes the month and day names agree exactly with CLDR now,
    	CLDR does not use the single code points for the digraphs either.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                |   20 +
 localedata/locales/hr_HR |   18 +-
 localedata/locales/tr_TR | 2112 ++--------------------------------------------
 3 files changed, 82 insertions(+), 2068 deletions(-)

Comment 40 Dragan Stanojevic - Nevidljivi 2017-12-04 18:27:32 UTC

Big thanks to Mike FABIAN for working on resolving this, and being through with the ending solution by brainstorming on digraphs usage, making locale more in line with CLDR, and making it more practical by avoiding digraphs in LC_TIME...

Comment 41 Rafal Luzynski 2017-12-05 23:57:36 UTC

(In reply to cvs-commit@gcc.gnu.org from comment #39)
> [...]
> commit 1f6d91f328b7699610210d7d56d2cc49d60e1c27
> Author: Mike FABIAN <mfabian@redhat.com>
> Date:   Mon Dec 4 13:10:29 2017 +0100
> 
>     hr_HR locale: Don’t use single code points for the digraphs in LC_TIME
>     
>     	[BZ #10580]
>     	* localedata/locales/hr_HR (LC_TIME): Use two letters for the
>     	digraphs in the month and day names. Using single code points for
>     	digraphs is deprecated.  While there are dedicated Unicode
>     	codepoints, for the digraphs, these are included for backwards
>     	compatibility and modern texts use a sequence of Basic Latin
>     	characters. See: https://www.unicode.org/faq/ligature_digraph.html
>     	This makes the month and day names agree exactly with CLDR now,
>     	CLDR does not use the single code points for the digraphs either.
> [...]

Before this change all abmon items (abbreviated month names) were 3 letters long. Now all are 3 letters long except the second item (February, Feb) which is "velj", 4 letters long. Previously it was "veǉ" therefore 3 letters. Dragan, wouldn't you prefer it to be "vel", consequently 3 letters long? The page https://vlada.gov.hr/ uses "Vel". CLDR uses "velj" so if you'd like this change I suggest creating a new ticket in in CLDR first: http://unicode.org/cldr/trac/newticket

Comment 42 Dragan Stanojevic - Nevidljivi 2017-12-06 03:15:14 UTC

> Before this change all abmon items (abbreviated month names) were 3 letters
> long. Now all are 3 letters long except the second item (February, Feb)
> which is "velj", 4 letters long. Previously it was "veǉ" therefore 3
> letters. Dragan, wouldn't you prefer it to be "vel", consequently 3 letters
> long?

True, before this change all were 3 letters, but through discussion with Mike several arguments were made against using digraphs in LC_TIME:
- Unicode has since moved away from promoting them
- They have a lot of problems with digraphs and even tried to solve it with: "U+034F COMBINING GRAPHEME JOINER" fix, so that digraphs would be glued with it, but still written as two separate letters.
- Digraphs often look ugly in fonts, or are not contained in them so they're substituted from another font, terminals in general don't have Unicode fonts, and in TUI apps, it is better not to force digraphs, example would be `cal` or TUI mail clients, shell prompt, tmux, ...
- abbreviations in many glibc locales isn't 3 letters. There is no rule that they need to be, they just need to be shorter.
- I have wrongly assumed all abbr. needed to be of same length, they don't. If that was the case I'd be more stubborn on digraphs, this way I'm more in favor of "Velj".
- Many applications and many programmers decided to avoid glibc locale since it was ugly. They either decided to make their own (LibreOffice for example), or they do something like taking first 3 letters of a month or day name, giving them wrong "Vel" values. "lj" is a digraph and a distinct phoneme, sounding different from simple "l". IMO "Vel" is more wrong then "Velj".
- Glibc and CLDR were once very stern in what they'd accept. Now they've become more pragmatic. One result is this issue with digraphs, but I hope that it is clear that it was done with end users in mind. There are not many Unicode digraphs used. And people will continue to type two letters for them since entering digraphs is still awkward.

In the end, this patch was done more than 8 years ago. It was a complete rewrite of the old locale and intention was to make it correct and easy to read/maintain. During those 8+ years several bugs were issued towards hr_HR and all of them were dups of this one, since I've solved all the issues back then. Yet so many maintainers avoided this patch for one reason or the other. During discussion with Mike, I really wasn't into forcing digraphs except in LC_COLLATE, since that would be awkward for end users, and most other locales avoid digraphs anyway. Even Unicode FAQ notes that they're troublesome in so many practical ways.

In the end, I'm open to thoughts and arguments of others, especially end users, but this patch, in any conceivable way compared to the previous state, is a huge push towards maintainable and clear hr_HR locale.

Comment 43 Mike FABIAN 2017-12-06 06:55:54 UTC

(In reply to Rafal Luzynski from comment #41)

> Before this change all abmon items (abbreviated month names) were 3 letters
> long. Now all are 3 letters long except the second item (February, Feb)
> which is "velj", 4 letters long. Previously it was "veǉ" therefore 3
> letters. Dragan, wouldn't you prefer it to be "vel", consequently 3 letters
> long?

No, I don’t think this makes sense because lj belongs
together, one should not cut this digraph in the middle.

Several other locales also have abbreviations
for the  month and day names longer than 3 characters.
I think that is OK if it makes no sense to cut off
after 3 characters.

Comment 44 Rafal Luzynski 2017-12-06 22:25:51 UTC

That's OK, if "lj" is a digraph which should not be split and "vel" is not correct and "velj" is the correct abbreviation then let's leave it as is.