Differences between revisions 20 and 21
Revision 20 as of 2014-06-04 06:03:54
Size: 17337
Comment: Minor AM/PM clarification
Revision 21 as of 2014-06-24 15:52:04
Size: 17878
Comment: Add a developer note
Deletions are marked like this. Additions are marked like this.
Line 59: Line 59:
Application developers should note that while some of the resulting strings (like {{{int_curr_symbol}}} of LC_MONETARY) are required to be of certain length other resulting strings may vary, especially between locales (see [[https://sourceware.org/ml/libc-locales/2014-q2/msg00046.html|this]] and [[https://sourceware.org/ml/libc-locales/2014-q2/msg00051.html|this]] email for examples). Thus, testing applications with several locales is recommended to make sure different length strings do not cause inconsistencies in user experience.

Locales in GLIBC

Introduction

This page provides an introduction to locales in general and a more detailed description on how to use locales (both as a user and a developer) and how to create and update actual locale definition files with the GNU C Library.

Overview

Locales in short are collections of language and country specific conventions allowing to adapt software to the user's preferences.

Each locale has following categories to specify related conventions which can then be selected for use by the user:

  • LC_IDENTIFICATION - this is not a user-visible category, it contains information about the locale itself and is rarely useful for users or developers (but is listed here for completeness sake).
  • LC_CTYPE - this category applies to classification and conversion of characters and is relevant when dealing with non-ASCII characters.
  • LC_COLLATE - this category applies to collation of strings and is relevant when processing strings or text in a certain language.
  • LC_TIME - this category applies to formatting date and time values, for example whether to present the current time in 12 or 24 hour format.
  • LC_NUMERIC - this category applies to formatting numeric values that are not monetary, for example the decimal separator.
  • LC_MONETARY - this category applies to formatting monetary values, for example the currency symbol.
  • LC_MESSAGES - this category applies to selecting the language used in the user interface for message translations and defines expressions for affirmative and negative responses. The software providing the user interface must provide the actual message translations for the selected language for LC_MESSAGES to have the desired effect.
  • LC_PAPER - this category applies to selecting the paper size (A4 or US).
  • LC_MEASUREMENT - this category applies to selecting the measurement system (metric or US).
  • LC_NAME - this category applies to represent a person's name and title.
  • LC_ADDRESS - this category applies to postal addresses and country and language names.
  • LC_TELEPHONE - this category applies to formatting telephone numbers.

The locales of the GNU C library are named using the pattern ll_CC.SSS where ll refers to language, CC refers to country, and SSS refers to the character set to use. For example, the English language as used in Canada using the UTF-8 character set is to be referenced as en_CA.UTF-8. (For more information on characters sets, please see http://www.joelonsoftware.com/articles/Unicode.html.)

Using Locales

The user wishing to use one or more of the above categories can take them in use by defining the corresponding environment variables, for example in the ~/.i18n file which is used on many systems automatically.

There are two additional environment variables which are considered when selecting locales - LANG and LC_ALL. LANG sets the default locale for all categories which can be then overridden by defining additional selected categories. LC_ALL forces the locale for all categories and the selection cannot be overridden. Using LC_ALL is mostly recommended in scripts only (for example, to make sure characters and collation are as the developer was expecting regardless of the user's preferences), while LANG in general should be preferred for the flexibility it provides.

The following example sets the default locale as Mexican Spanish, then sets collation rules to be based on the standard C locale, and finally sets monetary formatting to follow US convetions.

  LANG=es_MX.UTF-8
  LC_COLLATE=C
  LC_MONETARY=en_US.UTF-8

Since above LANG is defined but LC_MESSAGES is left undefined, message translations will use Mexican Spanish. The same logic goes for the other undefined categories, like LC_TIME, too.

Developing with Locales

The online GNU C Library manual provides a good starting point for developers creating applications supporting locales. At least the following pages are relevant:

Linux manual pages were greatly improved during 2014 and are now almost complete, the following pages serve as a good starting point:

Application developers should note that while some of the resulting strings (like int_curr_symbol of LC_MONETARY) are required to be of certain length other resulting strings may vary, especially between locales (see this and this email for examples). Thus, testing applications with several locales is recommended to make sure different length strings do not cause inconsistencies in user experience.

GLIBC Locale Internals

The locale API and definitions of concrete locales are a rather individual part of the GNU C Library; related to this is the subsystem dealing with various charsets and converting between them.

  • locale/ directory contains the source for the locale API and support tools (localedef, locale, ...).

  • localedata/ directory (semi-independent with its own README and ChangeLog) contains the localedata/locales/ definitions of the set of locales available on a default GNU system, the localedata/charmaps/ character maps for available charsets, a testsuite, and few helper files.

  • iconv/ directory contains the source for the iconv charset conversion API and duct-tape for gconv modules implementing concrete charsets.

  • iconvdata/ directory contains the modules for the concrete charsets themselves.

Creating and Updating Locale Data

Please keep in mind the following before starting to work on glibc locales. If in doubt, send a question to libc-alpha/libc-locales mailing lists.

Qualifications

The glibc maintainers cannot easily judge on their own if your new version is correct.

  • If your proposed contribution is a non-trivial change, you should get an approval from the locale maintainer or the original creator.
  • In case they are unreachable or there is a dispute, you should cite authoritative references and/or multiple widely used websites (government pages or websites of major national newspapers are good choices).
  • If the change is not clear-cut, for example you want to switch between two common usages, it is good to gather (and link to) feedback from local user community.
  • If at all possible, the proposed changes should be aligned with the data in the Unicode Common Locale Data Repository, CLDR - http://cldr.unicode.org/.

The locale definitions are in a specific file format; it is described below and in the manual pages, please read them both for complete understanding.

Locale File Format

The locale definition are in a specific file format; most relevant notes can be found from the locale(5) manual page. POSIX also describes the format and some fields, but not all that are commonly used in glibc (e.g. week start definitions). All strings in the glibc locale files use Unicode entity specifications instead of plain characters; when working with a locale, to quickly inspect the file just do gcc -o show-ucs-data localedata/show-ucs-data.c (no need to do any build preparation for this, not even ./configure) and then just ./show-ucs-data localedata/locales/en_US.

One additional resource describing locale categories and category members is the ISO/IEC TR 14652:2002(E) Technical Report (PDF). However, some members are described below to address some glibc specific requirements and formatting issues. In case of doubt, please refer to manual pages, other glibc locales for examples, or send a question to libc-alpha/libc-locales mailing lists.

Comments

You should provide plenty of comments in the locale file, both about the individual members of each category and also any relevant references. If something is left undefined on purpose, the reason should be stated.

LC_IDENTIFICATION

This category is pretty much self-explaining. You should be able to fill this category by using other locales as examples.

LC_CTYPE

This category deals with character sets and transliteration rules both most often based on Unicode standards. It defines for example what characters are considered to be alphabetic or how to transliterate characters and text from one encoding to another. See other locales for inspiration how to implement these rules. See the Testing Locales section below for tips how to verify implementation. Note also bug 14094 and bug 16061.

LC_COLLATE

In many/most countries and languages there are official guidelines and standards on collation rules. Often these are based on the well-known ISO 14651 standard or on the Unicode collation algorithm. See other locales for inspiration how to implement the required collation rules. See the Testing Locales section below for tips how to verify that the implementation matches the relevant standards. Note also bug 14095.

LC_TIME

This is one of the most often used categories. The non-obvious members of this category are as follows:

  • am_pm and t_fmt_ampm - should be empty if using 24 hour time

  • week DAYSINWEEK;WEEKSTARTDATE;MINWEEKLEN - DAYSINWEEK is usually 7; MINWEEKLEN is the minimal length of the first week in year (usually 4). WEEKSTARTDATE is most confusing - it should be some date that corresponds to the beginning of a week. It is typically either 19971130 (Sunday) or 19971201 (Monday), the former being used almost by all glibc locales (see below).

  • first_weekday N - number of day in the week to be shown in the first column of a calendar. Defaults to 1.

  • first_workday N - number of the first working day in the week. Defaults to 2.

Furthermore, there is the question of the abday and day keywords and which day of week should the lists start with. Specs say Sunday, but they do not mention any of the week start specifiers. Applications aware of these tend to interpret the abday and day lists in a more complicated way.

The tricky thing is how to reconcile information from WEEKSTARTDATE and first_weekday. PetrBaudis wrote some lenghty treatises about this on libc-locales; we present the outcome and thus our de facto current interpretation:

  • WEEKSTARTDATE specifies the base of the abday and day lists

  • first_weekday specifies the offset of the first day-of-week in the abday and day lists

  • For compatibility reasons, all locales should set WEEKSTARTDATE on 19971130 (Sunday) and base the abday and day lists appropriately, and set first_weekday 1 or 2 depending to whether their week actually starts on Sunday or Monday.

Thus, for example en_GB definition (English locale with week starting on Monday) is:

  week          7;19971130;4
  first_weekday 2
  first_workday 2
  day           "Sunday;Monday;Tuesday;Wednesday;Thursday;Friday;Saturday"
  abday         "Sun;Mon;Tue;Wed;Thu;Fri;Sat"

When your locale is compiled, you can use a simple first_weekday test tool to check the day definitions are correct.

LC_NUMERIC

This category has only three members. See the manual page, POSIX, and ISO TR 14652 references and e.g. http://h71000.www7.hp.com/doc/73final/6494/6494pro_003.html for more information on grouping.

LC_MONETARY

This category is well described in the manual page, POSIX, and ISO TR 14652 references.

LC_MESSAGES

This category defines regular expressions to be accepted as positive or negative response and equivalents of yes and no.

In yesstr and nostr the beginning should reflect whether the beginning of the word is capitalized or in lowercase. This is in line with other specifications in glibc locales, where e.g. day names and month names reflect the use of lower and upper case as prescribed in dictionaries, which records the canonical form of the word or phrase. This is also in accordance with use in POSIX and ISO TR 14652, and general guidelines for definitions in ISO/IEC.

LC_PAPER

Here A4 is 297x210 and US Letter is 279x216.

LC_MEASUREMENT

Here 1 means metric, 2 means US.

LC_NAME

This category is well explained in the manual page, POSIX, and ISO TR 14652.

name_fmt should be always defined, other members only if they are commonly used.

LC_ADDRESS

This category is explained in the manual page, POSIX, and ISO TR 14652. The following notes apply for glibc locales:

  • country_ab2, country_ab3 - two/three-letter ISO 3166 country code

  • country_name, lang_name - country, language name in this language

  • country_num - ISO 3166 numeric code in simple numbers without quotes (others like country_isbn are quoted Unicode points as usual)

  • lang_ab - two-letter ISO 639 code if available, empty if not (when only three-letter code is available)

  • lang_term - three-letter ISO 639-2/T (Terminology) code

  • lang_lib - three-letter ISO 639-2/B (Bibliographic) code

Applications should prefer lang_term over lang_lib. There are 20 specific ISO 639-2/B codes, both ISO 639-2/T and ISO 639-2/B are listed at http://www.loc.gov/standards/iso639-2/langhome.html.

LC_TELEPHONE

This category is well explained in the manual page, POSIX, and ISO TR 14652.

Testing Locales

After modifying a locale, make sure it compiles, and install it to a temporary directory for testing. The following example is ran on the glibc source tree root:

  LOCALE=fi_FI
  export LOCPATH=$HOME/locale-test/
  mkdir -p $LOCPATH
  I18NPATH=./localedata/ localedef -f UTF-8 -i $LOCALE $LOCPATH/$LOCALE.UTF-8
  LC_ALL=$LOCALE.UTF-8 locale -ck LC_TIME
  LC_ALL=$LOCALE.UTF-8 locale -ck date_fmt
  LC_ALL=$LOCALE.UTF-8 date
  LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt
  LC_ALL=$LOCALE.UTF-8 sort < sorting-test-input.txt

These commands prepare a user directory for testing a locale, compile and install the locale, show contents of a category and a value of a certain keyword, and run commands which use different categories.

If you have set up glibc compilation environment, you can mass-test compilation of all locales:

  • make localedata/install-locales install_root=(configure's --prefix)

(And invoking (configure's --prefix)/bin/localedef manually will of course install to the (configure --prefix).)

See localedata/README and localedef(1) for more information about localedef(1).

Contributing

Contribute the locale updates in the form of bugs in the glibc bugzilla. However, when contributing locale updates, always try to get in touch with the locale maintainer first; if this is unsuccessful, try to describe the changes you have made, and (this is important) provide some proofs that this reflects common usage - e.g. local government or big newspapers sites, references to language norms, etc.

Please test your changes before submitting them, see above for testing instructions.

See Contribution checklist for complete contributing instructions.

Miscallaneous Information

Charsets

The iconv conversion internally always works by converting from source charset to UCS-4 and then from UCS-4 to the target charset. This implies that the charset modules need to implement only to/from Unicode mapping, and that characters not in Unicode are not convertable (luckily, this seems to be currently the case only for few obscure ancient kanji characters).

Most of the charsets are simple (single-byte with direct 1-1 Unicode mapping). .c files for these are trivial, depending on data provided by .h files, autogenerated by iconvdata/gen-8bit.sh from localedata/charmaps/ files at build time.

References

None: Locales (last edited 2014-06-24 15:52:04 by MarkoMyllynen)