17750 – wrong collation order of diacritics in most locales

Bug 17750 - wrong collation order of diacritics in most locales

Summary: wrong collation order of diacritics in most locales

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	2.27
Assignee:	Alexandre Oliva

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-12-23 04:25 UTC by Alexandre Oliva
Modified:	2017-12-03 15:02 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
attachment-71592-0.html (701 bytes, text/html) 2017-12-03 15:02 UTC, Chris Leonard	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexandre Oliva 2014-12-23 04:25:27 UTC

http://www.unicode.org/reports/tr10/tr10-30.html states:

<quote>
Normally, all differences in sorting are assessed from the start to the end of the string. If all of the base letters are the same, the first accent difference determines the final order. In row 1 of Table 5, the first accent difference is on the o, so that is what determines the order. In some French dictionary ordering traditions, however, it is the last accent difference that determines the order, as shown in row 2.
</quote>

Table 5 says:

<pre>
Normal Accent Ordering  	cote < coté < côte < côté
Backward Accent Ordering 	cote < côte < coté < côté
</pre>

However, glibc implements backward accent ordering for all locales except de_DE and lb_LU.  

Unicode CLDR 26 confirms this is wrong: the only file in http://unicode.org/cldr/trac/browser/tags/release-26/common/collation/ that has settings backwards="on" is fr_CA.xml.

Comment 1 Alexandre Oliva 2014-12-23 04:30:07 UTC

Mine.  I posted a patch at https://sourceware.org/ml/libc-alpha/2014-12/msg00524.html

Comment 2 keld@keldix.com 2014-12-23 18:11:54 UTC

On Tue, Dec 23, 2014 at 04:25:27AM +0000, aoliva at sourceware dot org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> 
>             Bug ID: 17750
>            Summary: wrong collation order of diacritics in most locales
>            Product: glibc
>            Version: unspecified
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: localedata
>           Assignee: unassigned at sourceware dot org
>           Reporter: aoliva at sourceware dot org
>                 CC: libc-locales at sourceware dot org
> 
> http://www.unicode.org/reports/tr10/tr10-30.html states:
> 
> <quote>
> Normally, all differences in sorting are assessed from the start to the end of
> the string. If all of the base letters are the same, the first accent
> difference determines the final order. In row 1 of Table 5, the first accent
> difference is on the o, so that is what determines the order. In some French
> dictionary ordering traditions, however, it is the last accent difference that
> determines the order, as shown in row 2.
> </quote>
> 
> Table 5 says:
> 
> <pre>
> Normal Accent Ordering      cote < coté < côte < côté
> Backward Accent Ordering     cote < côte < coté < côté
> </pre>
> 
> However, glibc implements backward accent ordering for all locales except de_DE
> and lb_LU.  
> 
> Unicode CLDR 26 confirms this is wrong: the only file in
> http://unicode.org/cldr/trac/browser/tags/release-26/common/collation/ that has
> settings backwards="on" is fr_CA.xml.

This was probably done because if there are more than one accented letter in a string,
the word or name is probably French, and then the french rules should be followed.
This would mean that CLDR is wrong.

Best regards
Keld

Comment 3 Alexandre Oliva 2014-12-23 23:00:50 UTC

Even if your assumption that more than one diacritic in a word implied the word was in French, there are various other points that make your suggestion flawed.

First of all, the forward or backward accent ordering doesn't even apply to all French speakers.

Second, there are words with more than one diacritic in other languages.  I happen to be a native speaker of one such language.

Third, you don't need more than one diacritic in a word to trigger the problem.  Consider Cortes, Córtes, and Cortés; pelo, pêlo, pelô; Schlagerforderung, Schlagerförderung, Schlägerforderung, Schlägerförderung.

Fourth, Unicode and CLDR are the result of a lot of work by a lot of people who study lots of languages and local customs.  It would take a lot more than groundless speculation to conclude they're wrong.  (Which is not to say they're perfect in all regards, of course ;-)

Comment 4 Carlos O'Donell 2014-12-24 14:04:37 UTC

(In reply to Alexandre Oliva from comment #3)
> Even if your assumption that more than one diacritic in a word implied the
> word was in French, there are various other points that make your suggestion
> flawed.
> 
> First of all, the forward or backward accent ordering doesn't even apply to
> all French speakers.
> 
> Second, there are words with more than one diacritic in other languages.  I
> happen to be a native speaker of one such language.
> 
> Third, you don't need more than one diacritic in a word to trigger the
> problem.  Consider Cortes, Córtes, and Cortés; pelo, pêlo, pelô;
> Schlagerforderung, Schlagerförderung, Schlägerforderung, Schlägerförderung.
> 
> Fourth, Unicode and CLDR are the result of a lot of work by a lot of people
> who study lots of languages and local customs.  It would take a lot more
> than groundless speculation to conclude they're wrong.  (Which is not to say
> they're perfect in all regards, of course ;-)

I agree with Alex. We would need a very detailed analysis of why CLDR is wrong to ignore their implementation and do something different.

Comment 5 keld@keldix.com 2014-12-25 11:38:32 UTC

On Tue, Dec 23, 2014 at 11:00:50PM +0000, aoliva at sourceware dot org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> 
> --- Comment #3 from Alexandre Oliva <aoliva at sourceware dot org> ---
> Even if your assumption that more than one diacritic in a word implied the word
> was in French, there are various other points that make your suggestion flawed.
> 
> First of all, the forward or backward accent ordering doesn't even apply to all
> French speakers.
> 
> Second, there are words with more than one diacritic in other languages.  I
> happen to be a native speaker of one such language.
> 
> Third, you don't need more than one diacritic in a word to trigger the problem.
>  Consider Cortes, Córtes, and Cortés; pelo, pêlo, pelô; Schlagerforderung,
> Schlagerförderung, Schlägerforderung, Schlägerförderung.
> 
> Fourth, Unicode and CLDR are the result of a lot of work by a lot of people who
> study lots of languages and local customs.  It would take a lot more than
> groundless speculation to conclude they're wrong.  (Which is not to say they're
> perfect in all regards, of course ;-)

1. Which french speakers does not use the backward accent ordering?
I do have access to some of the sorting experts from the French community.

2. I see that for some languages, eg. German, it makes sense to use forward ordering on accents.
   Which languages would that apply to?

3. Yes, I see that there may be just one accent in some strings, and then the
ordering depends om the position. I was involved in the current recommendation to
use backward ordering in the default tables And I was not the only one,
and the recommendation came out of the sorting experts in ISO and I believe
also in CEN. 

4. Well, CLDR does not have more ressources that we have. And they are known
not to listen to other expertise than their own.

Best regards
Keld

Comment 6 Florian Weimer 2015-01-29 13:17:16 UTC

Fixing this will change the sort order of existing data, which is quite risky.  Is it really worth it?

Comment 7 Carlos O'Donell 2015-01-29 14:35:11 UTC

(In reply to Florian Weimer from comment #6)
> Fixing this will change the sort order of existing data, which is quite
> risky.  Is it really worth it?

For the long term support of locales it must change. Unless we get more maintainers my plan is to conintue to push that we match CLDR, UNICODE and thus exactly what libicu does and reduce the "surprise" for developers going from java to C/C++ or vice-versa.

Comment 8 Florian Weimer 2015-01-29 14:37:41 UTC

(In reply to Carlos O'Donell from comment #7)
> For the long term support of locales it must change. Unless we get more
> maintainers my plan is to conintue to push that we match CLDR, UNICODE and
> thus exactly what libicu does and reduce the "surprise" for developers going
> from java to C/C++ or vice-versa.

It would be possible to rename the locale each time the ordering changes (and change the environment settings), which might satisfy both needs (fixed locales for interactive use, predictable ordering for data at rest).

Comment 9 keld@keldix.com 2015-01-30 15:25:20 UTC

On Thu, Jan 29, 2015 at 02:35:11PM +0000, carlos at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> 
> --- Comment #7 from Carlos O'Donell <carlos at redhat dot com> ---
> (In reply to Florian Weimer from comment #6)
> > Fixing this will change the sort order of existing data, which is quite
> > risky.  Is it really worth it?
> 
> For the long term support of locales it must change. Unless we get more
> maintainers my plan is to conintue to push that we match CLDR, UNICODE and thus
> exactly what libicu does and reduce the "surprise" for developers going from
> java to C/C++ or vice-versa.

The fix is wrong, IMHO.

Best regards
Keld

Comment 10 Carlos O'Donell 2015-01-30 17:51:57 UTC

(In reply to keld@keldix.com from comment #9)
> On Thu, Jan 29, 2015 at 02:35:11PM +0000, carlos at redhat dot com wrote:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> > 
> > --- Comment #7 from Carlos O'Donell <carlos at redhat dot com> ---
> > (In reply to Florian Weimer from comment #6)
> > > Fixing this will change the sort order of existing data, which is quite
> > > risky.  Is it really worth it?
> > 
> > For the long term support of locales it must change. Unless we get more
> > maintainers my plan is to conintue to push that we match CLDR, UNICODE and thus
> > exactly what libicu does and reduce the "surprise" for developers going from
> > java to C/C++ or vice-versa.
> 
> The fix is wrong, IMHO.

Thanks for stating that. In this case we'll need to discuss why it's wrong and try to come to a consensus, including talking to CLDR about it. Thus this issue is going to be more work, but not impossible.

Comment 11 Egmont Koblinger 2015-09-08 08:50:34 UTC

This change broke (among others) the Hungarian locales (see 18934).

I totally agree with Alexandre's opinion (the assumptions made by the patch being wrong on so many levels); extending with a fifth one:

Even if there are some French words present in a list, if you're using a certain language then the alphabetical rules of that language should apply, not the French one. This is what locale definitions are about. Define in the French locales the way to sort words on a French UI, but please leave the other locales alone.

I'm disappointed that such a change that was doomed to break so many locales managed to make it into glibc. But I think that in the end it boils down to the lack of proper unittest coverage.

In the above mentioned bug I created an extensive unittest for Hungarian, one that points to the official rules of alphabetical sorting and takes the examples from that (plus many more), and would have failed with this change.

I encourage maintainers of locale files to come up with similarly extensive unittests.

Comment 12 Egmont Koblinger 2015-09-08 08:52:47 UTC

Sorry, make it a link: bug 18934.

Comment 13 Sourceware Commits 2017-03-28 14:36:39 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  ea1898dded26316e2e73adfb409224e864ffaa8b (commit)
      from  78c05814320cdc3377347f8e5fdbaa7cf5abf5b5 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ea1898dded26316e2e73adfb409224e864ffaa8b

commit ea1898dded26316e2e73adfb409224e864ffaa8b
Author: Egmont Koblinger <egmont@gmail.com>
Date:   Wed Mar 22 21:27:30 2017 -0400

    localedata: hu_HU: fix multiple sorting bugs (bug 18934)
    
    Fix the incorrect sorting order of a digraph and its geminated variant,
    regression introduced by a faulty fix to bug 13547 in commit
    b008d4c85619a753e441d7f473ba8af0db400bd6.
    
    Fix two inconsistencies in sorting unusual capitalization of digraphs
    (bug #18587).
    
    Enable DIACRIT_FORWARD to work around bug #17750.
    
    Sort foreign accents after the Hungarian ones.
    
    Add extensive unittests containing all the examples from The Rules of
    Hungarian Orthography and many more, including explanatory comments.

-----------------------------------------------------------------------

Summary of changes:
 NEWS                     |    4 +
 localedata/ChangeLog     |    7 +
 localedata/Makefile      |    4 +-
 localedata/hu_HU.in      |  560 ++++++++++++++++++++++++++++++++++++++++++++++
 localedata/locales/hu_HU |  286 ++++++++++++------------
 5 files changed, 716 insertions(+), 145 deletions(-)
 create mode 100644 localedata/hu_HU.in

Comment 14 Sourceware Commits 2017-11-29 10:57:48 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  8da25eec0aaf4d86a06088fff8d175989835e071 (commit)
      from  a55430cb0e261834ce7a4e118dd9e0f2b7fb14bc (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8da25eec0aaf4d86a06088fff8d175989835e071

commit 8da25eec0aaf4d86a06088fff8d175989835e071
Author: Alexandre Oliva <aoliva@redhat.com>
Date:   Tue Nov 28 16:23:02 2017 +0100

    Collation fix: make forward accent sorting the default [BZ #17750]
    
    	[BZ #17750]
    	* Makefile: add fr_CA.UTF-8 to test-input and LOCALES.
    	* localedata/fr_CA.UTF-8.in: New file with test data for backward
    	accents sorting.
    	* localedata/fr_FR.UTF-8.in: Fix test data for forward accents
    	sorting.
    	* localedata/locales/cs_CZ (LC_COLLATE): Remove “define DIACRIT_FORWARD”
    	* localedata/locales/de_DE (LC_COLLATE): Likewise.
    	* localedata/locales/hu_HU (LC_COLLATE): Likewise.
    	* localedata/locales/lb_LU (LC_COLLATE): Likewise.
    	* localedata/locales/yuw_PG (LC_COLLATE): Likewise.
    	* localedata/locales/fr_CA (LC_COLLATE): Add “define DIACRIT_BACKWARD”
    	* localedata/locales/iso14651_t1_common: Use “ifdef DIACRIT_FORWARD”
    	instead of “ifdef DIACRIT_BACKWARD”.
    
    The only locale which currently needs backward accents sorting is fr_CA.
    Therefore, forward accents sorting should be the default.
    
    Before this patch, backwards accent sorting was the default and all
    locales except fr_CA had to use
    
        define DIACRIT_FORWARD
    
    before
    
        copy "iso14651_t1"
    
    Most locales didn’t do that and thus got the inappropriate backwards accents sorting
    by accident. Now only the fr_CA locale needs to use
    
        define DIACRIT_BACKWARD
    
    before
    
        copy "iso14651_t1"
    
    Original patch slightly modified by: Mike FABIAN <mfabian@redhat.com>

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                                     |   17 +++++++++++++++++
 localedata/Makefile                           |    4 ++--
 localedata/{fr_FR.UTF-8.in => fr_CA.UTF-8.in} |   18 +++++++++---------
 localedata/fr_FR.UTF-8.in                     |   22 +++++++++++-----------
 localedata/locales/cs_CZ                      |    2 --
 localedata/locales/de_DE                      |    2 --
 localedata/locales/fr_CA                      |    2 ++
 localedata/locales/hu_HU                      |    1 -
 localedata/locales/iso14651_t1_common         |    6 +++---
 localedata/locales/lb_LU                      |    2 --
 localedata/locales/yuw_PG                     |    1 -
 11 files changed, 44 insertions(+), 33 deletions(-)
 copy localedata/{fr_FR.UTF-8.in => fr_CA.UTF-8.in} (100%)

Comment 15 Mike FABIAN 2017-11-29 12:09:39 UTC

Fixed in glibc master.

Comment 16 keld@keldix.com 2017-11-29 13:11:04 UTC

Well all french language locales should be diacrit backward.
fr_FR, fr_BE, fr_CH and others.

Also other languages, where french words and names are the biggest source
of multiple accented characters should have diacrit backward.
This goes for Danish (my own language), Swedish, Norwegian, Finnish, Dutch.

Best regards
keld

On Wed, Nov 29, 2017 at 10:57:48AM +0000, cvs-commit at gcc dot gnu.org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> 
> 
>     The only locale which currently needs backward accents sorting is fr_CA.
>     Therefore, forward accents sorting should be the default.
> 
>     Before this patch, backwards accent sorting was the default and all
>     locales except fr_CA had to use
> 
>         define DIACRIT_FORWARD
> 
>     before
> 
>         copy "iso14651_t1"
> 
>     Most locales didn???t do that and thus got the inappropriate backwards
> accents sorting
>     by accident. Now only the fr_CA locale needs to use
> 
>         define DIACRIT_BACKWARD
> 
>     before
> 
>         copy "iso14651_t1"
> 
>     Original patch slightly modified by: Mike FABIAN <mfabian@redhat.com>

Comment 17 keld@keldix.com 2017-11-29 13:19:16 UTC

Probably also for all English language locales, and African language locales,
where French influence is big, and the use of accented characters in the african
language in question, eg Swahili, is very limited.

Best regards
keld

On Wed, Nov 29, 2017 at 03:10:28PM +0200, Keld Simonsen wrote:
> Well all french language locales should be diacrit backward.
> fr_FR, fr_BE, fr_CH and others.
> 
> Also other languages, where french words and names are the biggest source
> of multiple accented characters should have diacrit backward.
> This goes for Danish (my own language), Swedish, Norwegian, Finnish, Dutch.
> 
> Best regards
> keld
> 
> On Wed, Nov 29, 2017 at 10:57:48AM +0000, cvs-commit at gcc dot gnu.org wrote:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> > 
> > 
> >     The only locale which currently needs backward accents sorting is fr_CA.
> >     Therefore, forward accents sorting should be the default.
> > 
> >     Before this patch, backwards accent sorting was the default and all
> >     locales except fr_CA had to use
> > 
> >         define DIACRIT_FORWARD
> > 
> >     before
> > 
> >         copy "iso14651_t1"
> > 
> >     Most locales didn???t do that and thus got the inappropriate backwards
> > accents sorting
> >     by accident. Now only the fr_CA locale needs to use
> > 
> >         define DIACRIT_BACKWARD
> > 
> >     before
> > 
> >         copy "iso14651_t1"
> > 
> >     Original patch slightly modified by: Mike FABIAN <mfabian@redhat.com>

Comment 18 Egmont Koblinger 2017-11-29 19:27:32 UTC

(In reply to keld@keldix.com from comment #16)

> Also other languages, where french words and names are the biggest source
> of multiple accented characters should have diacrit backward.
> This goes for Danish (my own language), Swedish, Norwegian, Finnish, Dutch.

I can't speak any of these languages, but looking at some random Finnish text I see tons of ä and ö letters, a significant amount of words containing 2 or more of them. Hence I seriously doubt the correctness of your claim.

Even if looking only at the foreign words within these languages, I'd _guess_ that they take words from each other or maybe German more often than from French. But even if let's assume French is the most common source of foreign words, that's still not a strong enough reason to go for backwards diacrit ordering. In order for backwards diacrit ordering to even be a possibility to consider, I believe French accented words should outweigh all other local and foreign accented words combined.

IMO let's keep this unreasonable idea of backwards diacrit ordering to those language only that explicitly have it, let's not force this stupid concept on more locales than necessaary.

By the way, don't these language have some "official" collation rules, or at least some established common practice?

Comment 19 Florian Weimer 2017-11-29 19:49:36 UTC

(In reply to Egmont Koblinger from comment #18)

> By the way, don't these language have some "official" collation rules, or at
> least some established common practice?

I expect that many languages/scripts have multiple collation rules, depending on use, particularly when it comes to sorting foreign languages using the same base script.

Comment 20 Egmont Koblinger 2017-11-29 20:14:33 UTC

(In reply to Florian Weimer from comment #19)

> I expect that many languages/scripts have multiple collation rules,
> depending on use, particularly when it comes to sorting foreign languages
> using the same base script.

Let's not forget that most languages with Latin scripts do use accents regularly. I don't think glibc allows different diacrit ordering for "own" accents and "foreign" accents, e.g. in case of Finnish to use forward diacrit ordering for ä and ö, and backward diacrit ordering for é and û (and what if they're mixed?).

So the question is not how to sort _foreign_ words within the language, the question is how to sort _own_ words of the language. This defines the diacrit sorting. Foreign words will follow.

If a list to be sorted is composed solely of foreign words from a particular language, e.g. solely French words in an otherwise Finnish environment, it might be reasonable to sort using the rules of that language, e.g. French in this case. This can be achieved by setting LC_COLLATE=fr_FR.UTF-8.

In my opinion, the only valid question is what to do with English in territories where French is by far the second most popular language: is it reasonable to go with backward diacrits ordering there?

Comment 21 keld@keldix.com 2017-11-30 07:31:44 UTC

On Wed, Nov 29, 2017 at 07:27:32PM +0000, egmont at gmail dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> 
> --- Comment #18 from Egmont Koblinger <egmont at gmail dot com> ---
> (In reply to keld@keldix.com from comment #16)
> 
> > Also other languages, where french words and names are the biggest source
> > of multiple accented characters should have diacrit backward.
> > This goes for Danish (my own language), Swedish, Norwegian, Finnish, Dutch.
> 
> I can't speak any of these languages, but looking at some random Finnish text I
> see tons of ä and ö letters, a significant amount of words containing 2 or more
> of them. Hence I seriously doubt the correctness of your claim.

Well, in Finnish and other Nordic languages like Danish, Swedish and Norwegian, ö and ä etc
are not considered accented letters, but genuine separated letters, so that is why 
there are few strings with more than one accented letter.

> Even if looking only at the foreign words within these languages, I'd _guess_
> that they take words from each other or maybe German more often than from
> French. But even if let's assume French is the most common source of foreign
> words, that's still not a strong enough reason to go for backwards diacrit
> ordering. In order for backwards diacrit ordering to even be a possibility to
> consider, I believe French accented words should outweigh all other local and
> foreign accented words combined.

German umlaut letters are much the same in Finnish (and Swedish) and ä and ö are
then the same as the genuine Finnish/Swedish letters.

Yes, I also think that the total number of French words with 2 or more accented letters
(according to the rules of the specific language) should outweight the total
number of other occurrances, But I believe that this is the case in the examples that I
have given.

> By the way, don't these language have some "official" collation rules, or at
> least some established common practice?

There are specs from the official standards bodies specifying the backwards diacrit rules, yes.

Best regards
keld

Comment 22 keld@keldix.com 2017-11-30 07:40:18 UTC

On Wed, Nov 29, 2017 at 07:49:36PM +0000, fweimer at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> 
> --- Comment #19 from Florian Weimer <fweimer at redhat dot com> ---
> (In reply to Egmont Koblinger from comment #18)
> 
> > By the way, don't these language have some "official" collation rules, or at
> > least some established common practice?
> 
> I expect that many languages/scripts have multiple collation rules, depending
> on use, particularly when it comes to sorting foreign languages using the same
> base script.

That is not my experience. For Danish (my language) there is only one standard, and that
takes care of many foreign characters. Then there is a spec from Danish Standard
that is more elaborate, in the form of a POSIX/Linux locale, covering all of ISO 10646/Unicode,
that builds on ISO 14651, with the backwards diacrit spec. For German, I know there are 2 sorting
specs, one where ä, ö and ü etc are considered accented versions of a, o and u, and one 
where they are interpreted as ae oe and ue. There are sorting standards for all of these
languages, that are well adhered to in the market place.

best regards
keld

Comment 23 keld@keldix.com 2017-11-30 07:48:25 UTC

On Wed, Nov 29, 2017 at 08:14:33PM +0000, egmont at gmail dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> 
> --- Comment #20 from Egmont Koblinger <egmont at gmail dot com> ---
> (In reply to Florian Weimer from comment #19)
> 
> > I expect that many languages/scripts have multiple collation rules,
> > depending on use, particularly when it comes to sorting foreign languages
> > using the same base script.
> 
> Let's not forget that most languages with Latin scripts do use accents
> regularly. I don't think glibc allows different diacrit ordering for "own"
> accents and "foreign" accents, e.g. in case of Finnish to use forward diacrit
> ordering for ä and ö, and backward diacrit ordering for é and û (and what if
> they're mixed?).

I agree that glibc does not distinguish between "own" accented letters, and foreign.
Bot ä and ö are not accented letters in Finnish, they are genuine separate letters with
their own place in the alphabeth.

> In my opinion, the only valid question is what to do with English in
> territories where French is by far the second most popular language: is it
> reasonable to go with backward diacrits ordering there?

That is what I am suggesting, at least for Canada.
The same reasoning could be done for Dutch in Belgium, and then also the Netherlands.

Best regards
Keld

Comment 24 Egmont Koblinger 2017-11-30 09:09:25 UTC

(In reply to keld@keldix.com from comment #21)

> Well, in Finnish and other Nordic languages like Danish, Swedish and
> Norwegian, ö and ä etc
> are not considered accented letters, but genuine separated letters, so that
> is why 
> there are few strings with more than one accented letter.

Thanks for the explanation! (This actually should have occurred to me, as the famous Swedish yellow-blue furniture store offers framed pictures and bed linen showing the Swedish alphabet, with ÅÄÖ at the end.)

To clarify: If they sort German words containing ä and ö, they're sorted among the same letters of their own language, right? And what about French accents, are they on the other hand mixed together with their unaccented counterparts?

> German umlaut letters are much the same in Finnish (and Swedish) and ä and ö
> are
> then the same as the genuine Finnish/Swedish letters.

What about German ü?

(In reply to keld@keldix.com from comment #22)

> [...] Then there is a spec from Danish Standard
> that is more elaborate [...] with the backwards diacrit spec.

I'm shocked to hear that there's not only one language but more languages that use backwards diacritics, something that IMO no sane man with any tiny bit of common sense would ever decide on :-)

(In reply to keld@keldix.com from comment #23)

> That is what I am suggesting, at least for Canada.
> The same reasoning could be done for Dutch in Belgium, and then also the
> Netherlands.

If this is indeed what's correct for these languages / what people living there prefer then it's okay for me. I'm just hoping that the kinda de-facto standard en_US will stay with forward diacrits. I _guess_ Spanish is more frequently used there than French, plus again, I can't imagine how anyone ever could have come up with this braindamaged idea of backward diacrit sorting so I'd personally prefer en_US not to have this craziness :-)

Cheers!

Comment 25 keld@keldix.com 2017-12-03 13:16:10 UTC

On Thu, Nov 30, 2017 at 09:09:25AM +0000, egmont at gmail dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17750
> 
> --- Comment #24 from Egmont Koblinger <egmont at gmail dot com> ---
> (In reply to keld@keldix.com from comment #21)
> 
> > Well, in Finnish and other Nordic languages like Danish, Swedish and
> > Norwegian, ö and ä etc
> > are not considered accented letters, but genuine separated letters, so that
> > is why 
> > there are few strings with more than one accented letter.
> 
> To clarify: If they sort German words containing ä and ö, they're sorted among
> the same letters of their own language, right? And what about French accents,
> are they on the other hand mixed together with their unaccented counterparts?

Yes, German ö and ä are treated exactly as the Swedish letters.
And French accented letters like é and è are treated as 'e' but with an accent. é is actually
much used in Swedish proper.

> > German umlaut letters are much the same in Finnish (and Swedish) and ä and ö
> > are
> > then the same as the genuine Finnish/Swedish letters.
> 
> What about German ü?

ü is treated as an y AFAIK, but as with an accent. Danish æ and ø are treated as ä and ö
but as if they have an accent.

> (In reply to keld@keldix.com from comment #22)
> 
> > [...] Then there is a spec from Danish Standard
> > that is more elaborate [...] with the backwards diacrit spec.
> 
> I'm shocked to hear that there's not only one language but more languages that
> use backwards diacritics, something that IMO no sane man with any tiny bit of
> common sense would ever decide on :-)

Well, it is because the last accented character in French are more important
when pronounciated. I agree the it is a bit coulter-intuitive, but I do favour
the actual habits in the real world over what is logic.

> (In reply to keld@keldix.com from comment #23)
> 
> > That is what I am suggesting, at least for Canada.
> > The same reasoning could be done for Dutch in Belgium, and then also the
> > Netherlands.
> 
> If this is indeed what's correct for these languages / what people living there
> prefer then it's okay for me. I'm just hoping that the kinda de-facto standard
> en_US will stay with forward diacrits. I _guess_ Spanish is more frequently
> used there than French, plus again, I can't imagine how anyone ever could have
> come up with this braindamaged idea of backward diacrit sorting so I'd
> personally prefer en_US not to have this craziness :-)

the kind of defacto i18n locale has forward diacrits. i18n is the standard locale of ISO TR 30112.
I think both Spanish and German needs forward diacrits, and Spanish being a bigger
language than French would give that we should use forward diacrit as the default.

Best regards
Keld

Comment 26 Chris Leonard 2017-12-03 15:02:39 UTC

Created attachment 10659 [details]
attachment-71592-0.html

>
> >
> > If this is indeed what's correct for these languages / what people
> living there
> > prefer then it's okay for me. I'm just hoping that the kinda de-facto
> standard
> > en_US will stay with forward diacrits. I _guess_ Spanish is more
> frequently
> > used there than French, plus again, I can't imagine how anyone ever
> could have
> > come up with this braindamaged idea of backward diacrit sorting so I'd
> > personally prefer en_US not to have this craziness :-)
>
> the kind of defacto i18n locale has forward diacrits. i18n is the standard
> locale of ISO TR 30112.
> I think both Spanish and German needs forward diacrits, and Spanish being
> a bigger
> language than French would give that we should use forward diacrit as the
> default.
>
>
Don't forget that the conquistadores brought Spanish orthograohy to many of
the indigenous languages of the Americas as well, small in Intenet
footprint and speaker count, large in language count.

cjl