Bug 23857 - Esperanto has no country
Summary: Esperanto has no country
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-04 13:26 UTC by Carmen Bianca Bakker
Modified: 2018-12-20 09:38 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Carmen Bianca Bakker 2018-11-04 13:26:09 UTC
Since glibc 2.24, Esperanto has been available as the `eo.utf8` locale.  It was
added as more-or-less the only locale not to have an associated country.  For
translations, this works sufficiently well.  The problem, however, is that a lot
of projects don't handle the no-country locale very well.

- In GNOME's gnome-control-center, the user is given a choice to pick a language
  and a "format" (locale).  Esperanto is a language choice, but not a locale
  choice.  Instead, it defaults to "United States (English)".

- In Python's `locale`, unsetting all LC_* variables and running `LANG=eo
  python3`, you get `locale.getlocale() == ('eo_XX', 'ISO8853-3')`.

- In a lot of packages, you'll see something like `*_*` to match all locales.
  Esperanto has to be separately mentioned such that the expression becomes `eo
  *_*`.  See <https://bugzilla.redhat.com/show_bug.cgi?id=1643756>.

I still need to file bug reports for the first two examples, and there are more
examples that I haven't recorded in long-term memory.  The recurring problem,
however, is that Esperanto is the exception.  It's a special case that a lot of
projects don't account for, because what language could possibly not have a
country?

A simple, satisfactory solution would be to no longer make Esperanto a special
case.  Make it the same as all the other locales, and the problems will sort of
go away.  There are a couple of approaches to this:

1. Create "eo_NL" just like Interlingua---an auxiliary conlang similar to
   Esperanto--- has "ia_FR".  Separate locales might need to be created for
   different countries.

2. Create "eo_XX" or "eo_EO" as an exact copy of the current "eo" locale,
   excluding a lot of LC_ADDRESS information.

3. Create "eo_XX" or "eo_EO" with a fake "Esperantujo" country and currency.

4. Add a fake "Esperantujo" country and currency to the current "eo" locale,
   which might solve some problems, maybe?

5. Some combination of the above.

I have a slight preference for the first solution.  Users would be able to use
Esperanto while retaining their local currency, date formatting, etc etc etc.
It is also preferable in the sense that Interlangua already does this, thus
precedence has been set.

Alternative #6 is to keep the status quo and fix all the bugs in third party
projects that do not account for the special case of Esperanto.  This doesn't
scale very well, though.  If another no-country language comes along, it will
have to be added as exception to these other projects again.  It's also
cumulatively just a lot of work for a special case that not so many people use,
anyway.

I've briefly talked to Rafal about this issue on Fedora's trans list.  I think
we agree that it's not really a glibc bug, thus I felt hesitant reporting it
here, but a lot of tiny bugs in a lot of projects that use glibc.
Comment 1 Florian Weimer 2018-11-05 13:44:17 UTC
I'm not really convinced this is a glibc bug.

Wouldn't it make sense to fix applications bugs instead?

There are other artificial languages which may face the same issue once we add it to glibc.  Yiddish currently has a US locale, but isn't this a bit odd?
Comment 2 Carmen Bianca Bakker 2018-11-05 15:33:32 UTC
(In reply to Florian Weimer from comment #1)
> I'm not really convinced this is a glibc bug.
> 
> Wouldn't it make sense to fix applications bugs instead?

I agree that it isn't, and I agree that it would make sense to fix application bugs. The problem is that those application bugs happen because glibc presents a special case, and one could undo all these application bugs simultaneously by making sure that the special case isn't special anymore.

Even something so simple as my proposed solution 2 would get rid of a lot of bugs in programs that expect all locales to look like lang_COUNTRY.

> There are other artificial languages which may face the same issue once we
> add it to glibc.  Yiddish currently has a US locale, but isn't this a bit
> odd?

If there's a sizeable population of Yiddish speakers in the US, then that probably makes sense. It wouldn't make sense for Yiddish speakers outside of the US, though.  Problem is: Do you want to create a glibc locale for every possible country where Yiddish is spoken in some capacity? That would ultimately be the best solution for users, but might cause an annoying maintenance burden on glibc.

Ideally I'd like to see language and country completely separated from each other instead of combined in locales, because that would ultimately make the most sense, but that would be a super big redesign that I am not comfortable with proposing.  I'm currently limiting my scope to making Esperanto (more) usable on Fedora Workstation, and I think some of my above suggestions could significantly improve the status of Esperanto with relatively little effort (i.e., fixing all application bugs).
Comment 3 Dmitry V. Levin 2018-11-05 16:00:56 UTC
(In reply to Florian Weimer from comment #1)
> There are other artificial languages which may face the same issue once we
> add it to glibc.  Yiddish currently has a US locale, but isn't this a bit
> odd?

The comment is confusing.
Comment 4 Florian Weimer 2018-11-05 16:11:22 UTC
(In reply to Dmitry V. Levin from comment #3)
> (In reply to Florian Weimer from comment #1)
> > There are other artificial languages which may face the same issue once we
> > add it to glibc.  Yiddish currently has a US locale, but isn't this a bit
> > odd?
> 
> The comment is confusing.

Sorry, the two sentences are really separate.  I did not want to imply that Yiddish is an artificial language.  I think the majority of Yiddish speakers is *not* located in the United States.  I suspect the locale was added under “US” because there was no precedent for a locale without a country at the time.
Comment 5 Rafal Luzynski 2018-11-07 23:47:53 UTC
Hello Carmen, thank you for filing this bug report.

(In reply to Carmen Bianca Bakker from comment #0)
> [...]
> I still need to file bug reports for the first two examples, and there are
> more
> examples that I haven't recorded in long-term memory.  [...]

I encourage you to file those bug reports.  Are they maybe caused by the previous bug in glibc packaging in Fedora?

> [...]
> 1. Create "eo_NL" just like Interlingua---an auxiliary conlang similar to
>    Esperanto--- has "ia_FR".  Separate locales might need to be created for
>    different countries.

I was not aware of this case with Interlingua.  I would rather go for renaming "ia_FR" to "ia" so that "eo" would not be alone anymore :-) but my knowledge about Interlingua is too little to enforce it now.

> [...]
> Alternative #6 is to keep the status quo and fix all the bugs in third party
> projects that do not account for the special case of Esperanto.

This is my preferred choice and therefore I agree with Florian (comment 1) that this is not a bug here.  Also, I think it's good if we approach other projects and explain them how to fix the issue correctly.

(In reply to Carmen Bianca Bakker from comment #2)
> (In reply to Florian Weimer from comment #1)
> > There are other artificial languages which may face the same issue once we
> > add it to glibc.  Yiddish currently has a US locale, but isn't this a bit
> > odd?
> 
> If there's a sizeable population of Yiddish speakers in the US, then that
> probably makes sense.

As far as I know yes, there is a large population of Yiddish speakers in the US, they are about 160,000 people and I'm not sure but likely they are the largest Yiddish population in the world.

> It wouldn't make sense for Yiddish speakers outside of
> the US, though.  Problem is: Do you want to create a glibc locale for every
> possible country where Yiddish is spoken in some capacity? [...]

Most of the time this makes sense if two (or more) populations speaking the same language in two countries develop their languages to the extent that they differ little and actually make two variants of a language.  Good examples are US English vs. British English or Brazilian Portuguese vs. European Portuguese.

A secondary reason is when we want to provide other locale-dependent settings for multiple countries speaking the same language.

So adding a locale makes sense if there is a population needing that.  Existence of a locale in CLDR and an official recognition of a language by the local authorities are good argument for adding a locale variant.

> Ideally I'd like to see language and country completely separated from each
> other instead of combined in locales, because that would ultimately make the
> most sense,

Multiple environment variables (LC_MESSAGES, LC_MEASUREMENT, etc.) solve this problem to some extent.  That means, you don't have every combination of language/country but you can choose a separate locale for different purposes and that should be usually sufficient.

> but that would be a super big redesign that I am not comfortable
> with proposing.

+1

> I'm currently limiting my scope to making Esperanto (more)
> usable on Fedora Workstation, and I think some of my above suggestions could
> significantly improve the status of Esperanto with relatively little effort
> (i.e., fixing all application bugs).

Thank you for your effort, please continue.

Again, I think this is not a bug but I don't mind if we discuss this here.
Comment 6 Carmen Bianca Bakker 2018-11-08 07:48:28 UTC
Hi Rafal,

(In reply to Rafal Luzynski from comment #5)
> I encourage you to file those bug reports.  Are they maybe caused by the
> previous bug in glibc packaging in Fedora?

https://gitlab.gnome.org/GNOME/gnome-control-center/issues/260 - Appears glibc-related, because the languages and locales/formats map directly to glibc options.  I wish I was more competent with C, and I'd try to fix it up myself.

https://bugs.python.org/issue35163 - Some weird obsolete configuration.

> I was not aware of this case with Interlingua.  I would rather go for
> renaming "ia_FR" to "ia" so that "eo" would not be alone anymore :-) but my
> knowledge about Interlingua is too little to enforce it now.

Is it okay to add the author of the original Interlingua bug report to this bug report?  Perhaps they can add an original insight, and perhaps their motivation for choosing "ia_FR" over "ia".

> > It wouldn't make sense for Yiddish speakers outside of
> > the US, though.  Problem is: Do you want to create a glibc locale for every
> > possible country where Yiddish is spoken in some capacity? [...]
> 
> Most of the time this makes sense if two (or more) populations speaking the
> same language in two countries develop their languages to the extent that
> they differ little and actually make two variants of a language.  Good
> examples are US English vs. British English or Brazilian Portuguese vs.
> European Portuguese.
> 
> A secondary reason is when we want to provide other locale-dependent
> settings for multiple countries speaking the same language.
> 
> So adding a locale makes sense if there is a population needing that. 
> Existence of a locale in CLDR and an official recognition of a language by
> the local authorities are good argument for adding a locale variant.

CLDR has "Unknown Region" listed under ZZ, which would work sufficiently well for country-less languages.  i.e., proposed solution 2, or solution 3 with "Unknown Region" as country (and "XXX" as currency).

https://unicode.org/cldr/charts/34/summary/root.html

It could also work for Yiddish, where "yi_US" is for the Yiddish population inside the US, and "yi_ZZ" could be used by non-US Yiddish populations who are spread across many other countries.  Though in the case of Yiddish specifically, it might probably make sense to add an Israel entry, but that will likely depend on a qualified volunteer doing the work.
Comment 7 Rafal Luzynski 2018-11-17 00:11:29 UTC
Hi,

I'm sorry for the delayed reply.

(In reply to Carmen Bianca Bakker from comment #6)
> [...]
> https://gitlab.gnome.org/GNOME/gnome-control-center/issues/260 - Appears
> glibc-related, because the languages and locales/formats map directly to
> glibc options.  I wish I was more competent with C, and I'd try to fix it up
> myself.

Thank you.  I have not looked at the source code yet but my guess is that the list of territories comes from the list of locales with language part stripped.  This makes some sense to me: formats, units, etc. depend on the territory rather than language.  For example, English locale may have different units, currency, country name etc. for USA, UK, Australia, India, Ireland, and so on.  On the other hand, people living in one country probably use the same formats, units, and currency even if they speak different languages.  Therefore, if you want to select "Esperanto" as the locale for formats then... actually what would you expect?  Currency, country name, address format, car plate - "as used in (where?)"  Why "Netherlands" would not work better for you, for example?

I understand you may have some some good reasons to select Esperanto formats but I'm trying to reflect the reasons of GNOME designers.

> https://bugs.python.org/issue35163 - Some weird obsolete configuration.

My first suggestion is that Python should not map ambiguous locales into detailed ones but not supported by the current operating system.

Would adding "eo.ISO8859-3" help to fix this issue?  I think the reason is that historically the locales without the encoding specified used 8-bit encoding like ISO 8859-1 or ISO 8859-3.  Therefore often the locales map to 8-bit encodings unless you specify "utf8" explicitly.  Later when Unicode became popular and widely used, newly added locales in glibc used UTF-8 as their only encoding.  This is the case of Esperanto: "eo" is an alias of "eo.UTF-8".  Somehow Python treats it as an alias of "eo_XX.ISO8859-3".

On the other hand I am not sure if adding the old encodings makes sense nowadays.  Old encodings are preserved only in order not to break existing systems.  Does any existing Linux system use "eo.ISO8859-3" and rely on it?  Is it likely to be true if this locale has never existed?

> (In reply to Rafal Luzynski from comment #5)
> > I was not aware of this case with Interlingua.  I would rather go for
> > renaming "ia_FR" to "ia" so that "eo" would not be alone anymore :-) but my
> > knowledge about Interlingua is too little to enforce it now.
> 
> Is it okay to add the author of the original Interlingua bug report to this
> bug report?  Perhaps they can add an original insight, and perhaps their
> motivation for choosing "ia_FR" over "ia".

The bug report is https://sourceware.org/bugzilla/show_bug.cgi?id=14879 but I wouldn't like to bother the authors of Interlingua patch with the issues of Esperanto.

By the way, it has been recently considered a bug by CLDR to assign Interlingua to France:

http://unicode.org/cldr/trac/ticket/11164

This raises my motivation to rename "ia_FR" to "ia" but not to the level sufficient to actually do it.

> [...]
> CLDR has "Unknown Region" listed under ZZ, which would work sufficiently
> well for country-less languages.  i.e., proposed solution 2, or solution 3
> with "Unknown Region" as country (and "XXX" as currency).
> 
> https://unicode.org/cldr/charts/34/summary/root.html

It is possible as a workaround but I still believe we are able to handle "eo" without a country name.  Even more: we (the glibc project) are able to handle it and as there are projects which do not (yet) handle it correctly I think we should rather approach them and tell them how to fix it.  So far I don't think we have found any project where the issue exists and cannot be fixed.

> It could also work for Yiddish, where "yi_US" is for the Yiddish population
> inside the US, and "yi_ZZ" could be used by non-US Yiddish populations who
> are spread across many other countries.  Though in the case of Yiddish
> specifically, it might probably make sense to add an Israel entry, but that
> will likely depend on a qualified volunteer doing the work.

Definitely no, Yiddish is not an artificial language and definitely is related with some territories where it is actually spoken.  It seems to me that Israel could make sense and I don't mind adding it if needed, probably also USA makes sense.  I don't think that calling Yiddish "worldwide" or "non-US" or "unknown" (in terms of territory) makes sense because we can tell the same about any random language.

And please, if possible let's focus on Esperanto here rather than discussing possible changes in other languages.
Comment 8 Carmen Bianca Bakker 2018-11-20 12:02:45 UTC
(In reply to Rafal Luzynski from comment #7)
> Thank you.  I have not looked at the source code yet but my guess is that
> the list of territories comes from the list of locales with language part
> stripped.  This makes some sense to me: formats, units, etc. depend on the
> territory rather than language.  For example, English locale may have
> different units, currency, country name etc. for USA, UK, Australia, India,
> Ireland, and so on.  On the other hand, people living in one country
> probably use the same formats, units, and currency even if they speak
> different languages.  Therefore, if you want to select "Esperanto" as the
> locale for formats then... actually what would you expect?  Currency,
> country name, address format, car plate - "as used in (where?)"  Why
> "Netherlands" would not work better for you, for example?

The chief problem in selecting "Netherlands" is that LC_DATE won't have the correct language.  I would much rather individually select each LC_* option, but GNOME does not support that in its graphical interface.

> > https://bugs.python.org/issue35163 - Some weird obsolete configuration.
> 
> My first suggestion is that Python should not map ambiguous locales into
> detailed ones but not supported by the current operating system.
> 
> Would adding "eo.ISO8859-3" help to fix this issue?  I think the reason is
> that historically the locales without the encoding specified used 8-bit
> encoding like ISO 8859-1 or ISO 8859-3.  Therefore often the locales map to
> 8-bit encodings unless you specify "utf8" explicitly.  Later when Unicode
> became popular and widely used, newly added locales in glibc used UTF-8 as
> their only encoding.  This is the case of Esperanto: "eo" is an alias of
> "eo.UTF-8".  Somehow Python treats it as an alias of "eo_XX.ISO8859-3".
> 
> On the other hand I am not sure if adding the old encodings makes sense
> nowadays.  Old encodings are preserved only in order not to break existing
> systems.  Does any existing Linux system use "eo.ISO8859-3" and rely on it? 
> Is it likely to be true if this locale has never existed?

I don't think anything needs to be changed from glibc's end for this bug.  This appears to be a Python-only oddity---I have never encountered eo.ISO8859-3 anywhere else.

> It is possible as a workaround but I still believe we are able to handle
> "eo" without a country name.  Even more: we (the glibc project) are able to
> handle it and as there are projects which do not (yet) handle it correctly I
> think we should rather approach them and tell them how to fix it.  So far I
> don't think we have found any project where the issue exists and cannot be
> fixed.

I don't disagree, but wouldn't changing this in glibc be a much easier solution compared to the laborious process of opening bug reports everywhere to handle a special case?

For instance, if we assume for a moment that "ia_FR" will become "ia", then a lot of packages in a lot of distributions will need to change their for-loops to `for eo ia *_*`.  This is cumulatively a lot of work for minority languages.  A simple "ia_ZZ/eo_ZZ" would remove the special case and save a lot of work.

> Definitely no, Yiddish is not an artificial language and definitely is
> related with some territories where it is actually spoken.  It seems to me
> that Israel could make sense and I don't mind adding it if needed, probably
> also USA makes sense.  I don't think that calling Yiddish "worldwide" or
> "non-US" or "unknown" (in terms of territory) makes sense because we can
> tell the same about any random language.

I didn't imply that Yiddish is an artificial language.  I implied that having a catch-all "yi_ZZ" would save a lot of work over creating individual locales for all the countries in the world where Yiddish is spoken in some capacity, which is a lot of countries.  In that capacity, Yiddish is an excellent comparison to Esperanto, because both languages have a diaspora across the globe rather than a defined nation state.
Comment 9 Florian Weimer 2018-11-20 12:17:23 UTC
(In reply to Carmen Bianca Bakker from comment #8)
> The chief problem in selecting "Netherlands" is that LC_DATE won't have the
> correct language.  I would much rather individually select each LC_* option,
> but GNOME does not support that in its graphical interface.

The problem is that GNOME (and KDE) removed the previously existing functionality for separate category selection, without considering its implications.  I don't think we can work around the lack of such configuration options in glibc because it would increase the number of locales to a ridiculous amount.  This is less of an issue for systems like Debian which generate locale data files on demand, but it's hard on those systems which use a pre-computed locale archive, such as Fedora.
Comment 10 Pander 2018-12-20 09:38:14 UTC
As long as very local languages such as fy_DE, fy_NL, li_NL, nds_NL, nds_DE et cetera are being supported here while these don't even have a spell checker and lack translations in for many main applications, I think it support of locales for Esperanto and English for countries is validated.

Mixing of locale categories simply doesn't work as it should. Configuration tools of operating systems don't support it and doing it manually in configuration files or start-up scripts is of most users way too complex. But it is even more subtle:

Looking at existing locales such as en_DK, en_SE, en_DE and en_NL, about 50% to 75% of such locale can be accomplished with reuse of existing definitions via copy. However, the remaining part, are custom definitions for certain categories that cannot be realized by copy alone. The mixing of definitions are within the specific category.