Bug 31205 - Inconsistent (mon_)grouping formats
Summary: Inconsistent (mon_)grouping formats
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: 2.39
Assignee: Mike FABIAN
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-02 13:28 UTC by Oscar Gustafsson
Modified: 2024-01-25 10:50 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2024-01-18 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Oscar Gustafsson 2024-01-02 13:28:25 UTC
I was trying to look into using number grouping for a project and realized that the formats used is not consistent. For reference, here is the documentation:

https://sourceware.org/glibc/manual/html_node/General-Numeric.html

These are the two issues I've found:

* Many locales have the same digit repeated, e.g., en_US https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/en_US;h=5cc518dff2fc1309e5cddd86950d6e9898a2d7e1;hb=refs/heads/master#l75
As far as I can tell, it should be enough to have a single 3 there. As is the case for, e.g., en_HK https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/en_HK;h=5f797e076099c4972d3c74fe92e5a6607c3bae95;hb=refs/heads/master#l84

* Some locales have 0;0 as grouping, e.g. el_GR https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/el_GR;h=285e1e009276476f2aa2d2745177944c7b34a78b;hb=HEAD
Not sure what this is supposed to mean, but, e.g,. POSIX have -1 to indicate "no grouping" 
https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/POSIX;h=7ec7f1c5774ab1fb011c08e2e17d435923e48fe2;hb=refs/heads/master#l262 

Note that "The last member is either 0, in which case the previous member is used over and over again for all the remaining groups...", i.e., string termination, but here there will be a string with three string termination characters, to no previous member.

To some extent this is also the case for mon_grouping, at least the first case.

I guess the impact of this issue depends on the situation. The first one will just waste a few bytes (and lead to confusion), but the second may lead to weird results, at least in code using the raw localedata information without noticing this.

If people agree that this should be consistent and fixed (not so obvious what to replace 0;0 with, probably -1?), I'd be happy to provide a patch. (Even more happy to be able to do that using standard git-access, I can provide some credentials that I know how to use it etc.)
Comment 2 Mike FABIAN 2024-01-02 16:04:54 UTC
https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html says:

7.3.4 LC_NUMERIC
...

grouping
    Define the size of each group of digits in formatted non-monetary quantities. The operand is a sequence of integers separated by semicolons. Each integer specifies the number of digits in each group, with the initial integer defining the size of the group immediately preceding the decimal delimiter, and the following integers defining the preceding groups. If the last integer is not -1, then the size of the previous group (if any) shall be repeatedly used for the remainder of the digits. If the last integer is -1, then no further grouping shall be performed.
Comment 3 Mike FABIAN 2024-01-02 16:11:34 UTC
So in the el_GR locale, one could use

grouping -1


instead of 

grouping 0:0


But it does not seem to matter, both do the same:


mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (master $%)
$ grep -E "grouping.*(0;0|-1)" *
C:mon_grouping        -1
C:grouping        -1
POSIX:mon_grouping        -1
POSIX:grouping        -1
aa_DJ:grouping               0;0
ar_SA:mon_grouping      -1
ar_SA:grouping  -1
bs_BA:grouping                  0;0
el_CY:grouping                  0;0
el_GR:grouping                  0;0
eo:grouping      0;0
es_CU:grouping             0;0
gl_ES:grouping             0;0
i18n:mon_grouping        -1
i18n:grouping        -1
mg_MG:grouping                  0;0
pap_AW:grouping                  0;0
pap_CW:grouping                  0;0
pt_PT:grouping                  0;0
rw_RW:grouping                  -1
sl_SI:grouping                  0;0
sr_RS:grouping                  0;0
ti_ER:grouping              0;0
wo_SN:grouping                  0;0
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (master $%)
$ LC_ALL=rw_RW.UTF-8 /usr/bin/printf "%'f\n" 12345678.9
12345678,900000
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (master $%)
$ LC_ALL=el_GR.UTF-8 /usr/bin/printf "%'f\n" 12345678.9
12345678,900000
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (master $%)
$
Comment 4 Mike FABIAN 2024-01-02 16:14:24 UTC
Also 

grouping 3

and

grouping 3;3

behaves the same:

mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (master $%)
$ grep grouping en_US en_PH
en_US:mon_grouping        3;3
en_US:grouping        3;3
en_PH:mon_grouping          3
en_PH:grouping               3
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (master $%)
$ LC_ALL=en_US.UTF-8 /usr/bin/printf "%'f\n" 12345678.9
12,345,678.900000
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (master $%)
$ LC_ALL=en_PH.UTF-8 /usr/bin/printf "%'f\n" 12345678.9
12,345,678.900000
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (master $%)
$
Comment 5 Oscar Gustafsson 2024-01-02 16:37:55 UTC
Thanks for the reply.

Yes, they behave the same, but for consistency reasons I believe that one of them should be selected. 

Two reasons:

* When trying to understand how to specify these strings, the mix of formats (and redundant information) is rather confusing.

* There are other tools relying on these files and it would be better if there are fewer corner cases to handle/optimizations to be done.

I've later learnt that -1 is translated into "" by localeconv. Hence, one may suspect that 0;0 works because it translates into three(?) string termination characters. While this clearly works, one can hardly argue that it makes sense.

For the 3;3 case, it may make sense in the user code to check if there is a single digit and in that case have a fast path. Which 3;3 will never detect.

Or put another way: what is the benefit of having inconsistent data that may lead to redundant storage and additional computations?
Comment 6 Mike FABIAN 2024-01-18 15:11:58 UTC
OK, then I’ll change 0;0 ➡️ -1 and 3;3 ➡️ -1.
Comment 8 Mike FABIAN 2024-01-19 14:21:40 UTC
(In reply to Oscar Gustafsson from comment #5)

> * There are other tools relying on these files and it would be better if
> there are fewer corner cases to handle/optimizations to be done.

These other tools nevertheless need to be able to parse '3;3' and '0:0' as this remains possible.
Comment 10 Sourceware Commits 2024-01-25 10:41:39 UTC
The master branch has been updated by Mike Fabian <mfabian@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5176a830e70140cb3390c62b7d41f75dbbf33c7c

commit 5176a830e70140cb3390c62b7d41f75dbbf33c7c
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Thu Jan 18 16:52:03 2024 +0100

    localedata: Use consistent values for grouping and mon_grouping
    
    Resolves: BZ # 31205
    
    Adapt test cases in test-grouping_iterator.c
Comment 11 Mike FABIAN 2024-01-25 10:50:02 UTC
Fixed in glibc master.