Bug 3405

Summary: pt_BR: sort ordering issues
Product: glibc Reporter: Walter Cruz <walter.php>
Component: localedataAssignee: GNU C Library Locale Maintainers <libc-locales>
Status: NEW ---    
Severity: normal CC: edurbs, glibc-bugs, maiku.fabian, pasky
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Walter Cruz 2006-10-21 01:39:18 UTC
Hi all.

In pt_BR, the glibc doesn't count spaces in the sort order.

An example:

That list:

GABRIELA HELEDA DE SOUZA
GABRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIELA JACOBY NOS
GABRIEL ALEXANDRE DA SILVA MANICA
GÁBRIEL ALCIDES KLIM PERONDI
GÁBRIELA JACOBY NOS 

But the right order is:

GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIEL ALEXANDRE DA SILVA MANICA
GABRIELA HELEDA DE SOUZA
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS
GABRIELA LETICIA BATISTA NUNES


I find that I can change that on /usr/share/i18n/locales, adding:

reorder-after <U00A0>
<U0020><CAP>;<CAP>;<CAP>;<U0020>
reorder-end

in the session LC_COLLATE. After generate the locale again, I have the right
sort order.
Comment 1 eduardo 2007-01-30 16:17:23 UTC
When use "sort" command, it's the wrong sorted list:
~$ sort list.txt
GABRIELA HELEDA DE SOUZA
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS
GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIEL ALEXANDRE DA SILVA MANICA

Tested in ubuntu 6.06, fedora core 3, red hat 9 and openSUSE 10.2 (all i386),
with the same wrong sort order.
Comment 2 Petter Reinholdtsen 2007-01-30 16:23:22 UTC
Can you provide any references specifying that space should be handled
as a letter when soring in brazilian portugese?  Because if not, I suspect
you are mistaken when you believe space should be sorted that way.
Comment 3 Walter Cruz 2007-01-30 17:18:23 UTC
(In reply to comment #2)
> Can you provide any references specifying that space should be handled
> as a letter when soring in brazilian portugese?  Because if not, I suspect
> you are mistaken when you believe space should be sorted that way.

The rules are defined by ABNT (Assoaciação Brasileira de Normas e Técnicas) in a
paper called NBR 6033, but the document isn't public available.

But, as me and edurbs are native speakers, I think that you should believe us :D

[]'s
- Walter
Comment 4 keld@dkuug.dk 2007-01-30 18:45:29 UTC
Subject: Re:  sort order on pt_BR

On Tue, Jan 30, 2007 at 04:23:22PM -0000, pere at hungry dot com wrote:
> 
> ------- Additional Comments From pere at hungry dot com  2007-01-30 16:23 -------
> Can you provide any references specifying that space should be handled
> as a letter when soring in brazilian portugese?  Because if not, I suspect
> you are mistaken when you believe space should be sorted that way.

In most languages using a script with letters, you have two ordering
schemes, the standard one, and the word-by-word one. In the latter, space
is significant on the first level. So both are correct, culturally.

I don't know how we can have an easy way to have both schemes available
to the user, except we provide two locales, with a small delta
(replace-after) to make the word-by-word locale. And then a general
naming scheme so the user can chose easily, like the @euro variants.

best regards
Keld
Comment 5 Daniel Cristian Cruz 2007-03-09 13:00:43 UTC
(In reply to comment #0)
> I find that I can change that on /usr/share/i18n/locales, adding:
> 
> reorder-after <U00A0>
> <U0020><CAP>;<CAP>;<CAP>;<U0020>
> reorder-end
> 
> in the session LC_COLLATE. After generate the locale again, I have the right
> sort order.

It didn't worked with Fedora 5. After changing settings on pt_BR file, and run
the following command, still having the same problem...
localedef -i pt_BR -c -f ISO-8859-1 -A /usr/share/locale/locale.alias pt_BR

Did I make something wrong?

Kind regards...
Comment 6 Daniel Cristian Cruz 2007-03-22 19:23:24 UTC
(In reply to comment #5)
> Did I make something wrong?

Yes, I did. I put a space between <U0020> and <CAP>.

But it is still ordering in a strange behavior; 'a' and 'á' and 'ã' and 'à' are
the same characters. It is ordering like it were different.

Sorry...
Comment 7 Luiz K. Matsumura 2007-04-19 06:09:38 UTC
Hi Daniel

How is it ordering ?
I make tests and the behavior with and without the proposed change is the same
when ordering this characters.
May be this an another bug ?

(In reply to comment #6)
> (In reply to comment #5)
> > Did I make something wrong?
> 
> Yes, I did. I put a space between <U0020> and <CAP>.
> 
> But it is still ordering in a strange behavior; 'a' and 'á' and 'ã' and 'à' are
> the same characters. It is ordering like it were different.
> 
> Sorry...
> 

Comment 8 Pierre Habouzit 2007-04-25 23:01:57 UTC
(In reply to comment #0)
> Hi all.
> 
> In pt_BR, the glibc doesn't count spaces in the sort order.

FWIW fr_FR is hit as well, and many other locales are too.

cat a; echo "==========="; LC_ALL=fr_FR sort a
GABRIELA HELEDA DE SOUZA
GABRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIELA JACOBY NOS
GABRIEL ALEXANDRE DA SILVA MANICA
GÁBRIEL ALCIDES KLIM PERONDI
GÁBRIELA JACOBY NOS 
===========
GABRIELA HELEDA DE SOUZA
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS 
GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIEL ALEXANDRE DA SILVA MANICA

> I find that I can change that on /usr/share/i18n/locales, adding:
> 
> reorder-after <U00A0>
> <U0020><CAP>;<CAP>;<CAP>;<U0020>
> reorder-end
> 
> in the session LC_COLLATE. After generate the locale again, I have the right
> sort order.

Comment 9 Guilherme de S. Pastore 2007-11-03 12:49:43 UTC
Petter,

I can assure you that the proposed one is the behaviour any Brazilian would
expect since the age of 6, when they learn how to sort at school, right after
learning the alphabet.

If it is *really* necessary, I can pay for web access to the already mentioned
lousy 5-page document from ABNT which defines the technical norm for sorting
just to show you, but you may guess I'm not eager to :)
Comment 10 Daniel Henrique 2010-06-27 14:19:25 UTC
Hi, everybody. First of all i apologize for my poor writing skills. English is
not my native language.

pt_BR sort order seems odd to me. If this behavior is not a bug, i agree with
Keld's suggestion: To define a new locale, like pt_BR@abnt, using the "right"
sort order.

Can the reorder sample sentence handle lower and uppercase properly? The result
of a sort, without the suggested change in the locale definition file, can't:

LC_ALL=pt_BR LANG=pt_BR LANGUAGE=pt_BR sort a.txt 
gabriela heleda de souza
GABRIELA HELEDA DE SOUZA
gabriela jacoby nos
GABRIELA JACOBY NOS
gábriela jacoby nos
GÁBRIELA JACOBY NOS 
gabriel alcides klim perondi
GABRIEL ALCIDES KLIM PERONDI
gábriel alcides klim perondi
GÁBRIEL ALCIDES KLIM PERONDI
gabriela leticia batista nunes
GABRIELA LETICIA BATISTA NUNES
gabriel alexandre da silva manica
GABRIEL ALEXANDRE DA SILVA MANICA


The expected output:
gabriel alcides klim perondi
gábriel alcides klim perondi
gabriel alexandre da silva manica
gabriela heleda de souza
gabriela leticia batista nunes
gabriela jacoby nos
gábriela jacoby nos 
GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIEL ALEXANDRE DA SILVA MANICA
GABRIELA HELEDA DE SOUZA
GABRIELA LETICIA BATISTA NUNES
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS 


This is "tricky" because we don't just perform a lexicographically comparison of
each character (a Portuguese Java user will be happy to know that
String.compareTo is not enough to produce the sorted result that he expect, for
several reasons).
We first sort ignoring accented letters, then we use them as a
"tiebreaker/disambiguation criteria" (i don't know the correct term in English)
between equal full names. In the first step, a = á, but in the later step, a < á.


Well, that is all i know. I will try to get a copy of the Norma NBR 6033:1989
(NB 106) from ABNT to confirm (or not :-)) these examples.

Thanks.
Comment 11 Daniel Henrique 2010-06-27 14:25:54 UTC
And i don't know if the Norma is "case sensitive" or "case insensitive".
Comment 12 keld@keldix.com 2010-06-27 15:51:51 UTC
Subject: Re:  sort order on pt_BR

On Sun, Jun 27, 2010 at 02:25:55PM -0000, email_daniel_h at yahoo dot com dot br wrote:
> 
> ------- Additional Comments From email_daniel_h at yahoo dot com dot br  2010-06-27 14:25 -------
> And i don't know if the Norma is "case sensitive" or "case insensitive".

All the European language sorting standards I know of are case insensitive on the first
level, case only counts on the 3rd level. I expect this also to be true for
Portuguese. That is: most important distinction is base letter, second
is accent, third is case.

best regards
keld
Comment 13 Daniel Henrique 2010-06-27 15:58:53 UTC
For those interested in an workaround, for a CentOS 5.5 box (use at your own risk):

1. Copy the base locale definition file

cp /usr/share/i18n/locales/pt_BR pt_BR\@abnt\.src

2. Edit pt_BR@abnt.src and add

reorder-after <U00A0>
<U0020><CAP>;<CAP>;<CAP>;<U0020>
reorder-end

before END LC_COLLATE

3. Create new directories

mkdir /usr/lib/locale/pt_BR\@abnt
mkdir /usr/lib/locale/pt_BR\.utf8\@abnt

4. Compile the new locales

localedef --verbose -c -i pt_BR\@abnt.src -f ISO-8859-1 /usr/lib/locale/pt_BR\@abnt
localedef --verbose -c -i pt_BR\@abnt.src -f UTF-8 /usr/lib/locale/pt_BR\.utf8\@abnt

5. Check the new locales

locale -a | grep pt_BR


I don't know if this is the best way, but it is one way.

Maybe the directories can be different in other Linux distributions.

I think that will be better to create a new pt_BR@abnt.src with a "copy
statement" for each section inside it than to copy the whole source from
/usr/share/i18n/locales/pt_BR
Comment 14 Daniel Henrique 2010-06-27 16:02:27 UTC
(In reply to http://sources.redhat.com/bugzilla/show_bug.cgi?id=3405#c12)

Thanks, Keld.
Comment 15 Daniel Henrique 2010-09-30 00:31:41 UTC
Hi, everybody. I've got a copy of the Norma NBR 6033:1989 (NB 106). Can i sent
it (in private) for the person that will fix this bug? The "catch": The document
is a pdf file made of images in Portuguese.

Thanks.
Comment 16 eduardo 2020-10-04 15:39:13 UTC
So... here we will go 14 years later from my first post. Maybe when I die it will be patched.
What is missing to fix it?