Locales in GLIBC
Contents
1. Introduction
This page provides an introduction to locales in general and a more detailed description on how to use locales (both as a user and a developer) and how to create and update actual locale definition files with the GNU C Library.
2. Overview
Locales in short are collections of language and country specific conventions allowing to adapt software to the user's preferences.
Each locale has following categories to specify related conventions which can then be selected for use by the user:
- LC_IDENTIFICATION - this is not a user-visible category, it contains information about the locale itself and is rarely useful for users or developers (but is listed here for completeness sake).
- LC_CTYPE - this category applies to classification and conversion of characters and is relevant when dealing with non-ASCII characters.
- LC_COLLATE - this category applies to collation of strings and is relevant when processing strings or text in a certain language.
- LC_TIME - this category applies to formatting date and time values, for example whether to present the current time in 12 or 24 hour format.
- LC_NUMERIC - this category applies to formatting numeric values that are not monetary, for example the decimal separator.
- LC_MONETARY - this category applies to formatting monetary values, for example the currency symbol.
- LC_MESSAGES - this category applies to selecting the language used in the user interface for message translations and defines expressions for affirmative and negative responses. The software providing the user interface must provide the actual message translations for the selected language for LC_MESSAGES to have the desired effect.
- LC_PAPER - this category applies to selecting the paper size (A4 or US).
- LC_MEASUREMENT - this category applies to selecting the measurement system (metric or US).
- LC_NAME - this category applies to represent a person's name and title.
- LC_ADDRESS - this category applies to postal addresses and country and language names.
- LC_TELEPHONE - this category applies to formatting telephone numbers.
The locales of the GNU C library are named using the pattern ll_CC.SSS where ll refers to language, CC refers to country, and SSS refers to the character set to use. For example, the English language as used in Canada using the UTF-8 character set is to be referenced as en_CA.UTF-8. (For more information on characters sets, see charsets(7) and http://www.joelonsoftware.com/articles/Unicode.html.)
3. Using Locales
The user wishing to use one or more of the above categories can take them in use by defining the corresponding environment variables, for example in the ~/.i18n file which is used on many systems automatically.
There are two additional environment variables which are considered when selecting locales - LANG and LC_ALL. LANG sets the default locale for all categories which can be then overridden by defining additional selected categories. LC_ALL forces the locale for all categories and the selection cannot be overridden. Using LC_ALL is mostly recommended in scripts only (for example, to make sure characters and collation are as the developer was expecting regardless of the user's preferences), while LANG in general should be preferred for the flexibility it provides.
The following example sets the default locale as Mexican Spanish, then sets collation rules to be based on the standard C locale, and finally sets monetary formatting to follow US conventions.
LANG=es_MX.UTF-8 LC_COLLATE=C LC_MONETARY=en_US.UTF-8
Since above LANG is defined but LC_MESSAGES is left undefined, message translations will use Mexican Spanish. The same logic goes for the other undefined categories, like LC_TIME, too.
The user can use custom locales by compiling them with localedef(1) and loading them from a directory pointed by LOCPATH; see the GNU C Library online manual at Locale Names, the locale(1) manual page, and the Testing Locales section below for details and examples.
4. Developing with Locales
The online GNU C Library manual provides a good starting point for developers creating applications supporting locales. At least the following pages are relevant:
http://www.gnu.org/software/libc/manual/html_node/Locales.html
http://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html
http://www.gnu.org/software/libc/manual/html_node/Locale-Information.html
http://www.gnu.org/software/libc/manual/html_node/Character-Set-Handling.html
Linux manual pages were greatly improved during 2014 and are now more or less complete, the following pages serve as a good starting point:
Application developers should note that while some of the resulting strings (like int_curr_symbol of LC_MONETARY) are required to be of certain length other resulting strings may vary, especially between locales (see this and this email for examples). Thus, testing applications with several locales is recommended to make sure different length strings do not cause inconsistencies in user experience.
5. GLIBC Locale Internals
The locale API and definitions of concrete locales are a rather individual part of the GNU C Library; related to this is the subsystem dealing with various charsets and converting between them.
locale/ directory contains the source for the locale API and support tools (localedef, locale, ...).
localedata/ directory (semi-independent with its own README) contains the localedata/locales/ definitions of the set of locales available on a default GNU system, the localedata/charmaps/ character maps for available charsets, a testsuite, and few helper files.
iconv/ directory contains the source for the iconv charset conversion API and duct-tape for gconv modules implementing concrete charsets.
iconvdata/ directory contains the modules for the concrete charsets themselves.
6. Creating and Updating Locale Data
Please keep in mind the following before starting to work on glibc locales. If in doubt, send a question to libc-alpha/libc-locales mailing lists.
6.1. Qualifications
The glibc maintainers cannot easily judge on their own if your new version is correct.
- If your proposed contribution is a non-trivial change, you should get an approval from the locale maintainer or the original creator.
- In case they are unreachable or there is a dispute, you should cite authoritative references and/or multiple widely used websites (government pages or websites of major national newspapers are good choices).
- If the change is not clear-cut, for example you want to switch between two common usages, it is good to gather (and link to) feedback from local user community.
If at all possible, the proposed changes should be aligned with the data in the Unicode Common Locale Data Repository, CLDR - http://cldr.unicode.org/.
It is probably most convenient to review CLDR data via the web interface, for example the CLDR version 26 Finnish data is available at http://www.unicode.org/repos/cldr-aux/charts/26/summary/fi.html - change the version number and the language code as needed to see the data relevant to you
The locale definitions are in a specific file format; it is described below and in the manual pages, please read them both for complete understanding.
6.2. Locale File Format
The locale definition are in a specific file format; most relevant notes can be found from the locale(5) manual page. POSIX also describes the format and some fields, but not all that are commonly used in glibc (e.g. week start definitions). Most numbers and strings in the glibc locale files use Unicode entity specifications instead of plain characters, for the exceptions refer to other locale files and locale(5). When working with a locale, to quickly inspect the file just do gcc -o show-ucs-data localedata/show-ucs-data.c (no need to do any build preparation for this, not even ./configure) and then just ./show-ucs-data localedata/locales/en_US to print out the data in more readable form.
One additional resource describing locale categories and category members is the ISO/IEC TR 14652:2002(E) Technical Report (PDF). However, some members are described below to address some glibc specific requirements and formatting issues. In case of doubt, please refer to manual pages, other glibc locales for examples, or send a question to libc-alpha/libc-locales mailing lists.
6.2.1. Comments
You should provide plenty of comments in the locale file, both about the individual members of each category and also any relevant references. If something is left undefined on purpose, the reason should be stated.
6.2.2. LC_IDENTIFICATION
This category is pretty much self-explaining. You should be able to fill this category by using other locales as examples.
6.2.3. LC_CTYPE
This category deals with character sets and transliteration rules both most often based on Unicode standards. It defines for example what characters are considered to be alphabetic or how to transliterate characters from one encoding to another. See locale(5) and other locales for inspiration how to implement these rules. See the Testing Locales section below for tips how to verify implementation.
6.2.4. LC_COLLATE
In many/most countries and languages there are official guidelines and standards on collation rules. Often these are based on the well-known ISO 14651 standard or on the Unicode collation algorithm. See other locales for inspiration how to implement the required collation rules. See the Testing Locales section below for tips how to verify that the implementation matches the relevant standards. Note also bug 14095.
6.2.5. LC_TIME
This is one of the most often used categories. The non-obvious members of this category are as follows:
am_pm and t_fmt_ampm - should be empty if using 24 hour time
week DAYSINWEEK;WEEKSTARTDATE;MINWEEKLEN - DAYSINWEEK is usually 7; MINWEEKLEN is the minimal length of the first week in year (usually 4). WEEKSTARTDATE is most confusing - it should be some date that corresponds to the beginning of a week. It is typically either 19971130 (Sunday) or 19971201 (Monday), the former being used almost by all glibc locales (see below).
first_weekday N - number of day in the week to be shown in the first column of a calendar. Defaults to 1 (Sunday).
first_workday N - number of the first working day in the week. Defaults to 2 (Monday).
Furthermore, there is the question of the abday and day keywords and which day of week should the lists start with. Specs say Sunday, but they do not mention any of the week start specifiers. Applications aware of these tend to interpret the abday and day lists in a more complicated way.
The tricky thing is how to reconcile information from WEEKSTARTDATE and first_weekday. PetrBaudis wrote some lenghty treatises about this on libc-locales; we present the outcome and thus our de facto current interpretation:
WEEKSTARTDATE specifies the base of the abday and day lists
first_weekday specifies the offset of the first day-of-week in the abday and day lists
For compatibility reasons, all locales should set WEEKSTARTDATE on 19971130 (Sunday) and base the abday and day lists appropriately, and set first_weekday 1 or 2 depending to whether their week actually starts on Sunday or Monday.
Thus, for example en_GB definition (English locale with week starting on Monday) is:
week 7;19971130;4 first_weekday 2 first_workday 2 day "Sunday;Monday;Tuesday;Wednesday;Thursday;Friday;Saturday" abday "Sun;Mon;Tue;Wed;Thu;Fri;Sat"
When your locale is compiled, you can use a simple first_weekday test tool to check the day definitions are correct.
6.2.6. LC_NUMERIC
This category has only three members. See the manual page, POSIX, and ISO TR 14652 references and e.g. http://h71000.www7.hp.com/doc/73final/6494/6494pro_003.html for more information on grouping.
6.2.7. LC_MONETARY
This category is well described in the manual page, POSIX, and ISO TR 14652 references.
Note that if a locale uses a new, previously undefined currency, it should be added to locale/iso-4217.def.
6.2.8. LC_MESSAGES
This category defines regular expressions to be accepted as positive or negative response and equivalents of yes and no.
In yesstr and nostr the beginning should reflect whether the beginning of the word is capitalized or in lowercase. This is in line with other specifications in glibc locales, where e.g. day names and month names reflect the use of lower and upper case as prescribed in dictionaries, which records the canonical form of the word or phrase. This is also in accordance with use in POSIX and ISO TR 14652, and general guidelines for definitions in ISO/IEC.
6.2.9. LC_PAPER
Here A4 is 297x210 and US Letter is 279x216.
Rather than define height and width explicitly, locales should copy either the main language for that territory (e.g. most locales in India copy hi_IN), or they should copy i18n (for A4) or en_US (for US Letter).
6.2.10. LC_MEASUREMENT
Here 1 means metric, 2 means US.
Rather than define measurement explicitly, locales should copy either the main language for that territory (e.g. most locales in India copy hi_IN), or they should copy i18n (for metric) or en_US (for US).
Keep in mind that the measurement field reflects the "main" unit system. Some territories use metric for distances & weights, but Fahrenheit for temperature. In these cases, it should be configured for metric.
6.2.11. LC_NAME
This category is well explained in the manual page, POSIX, and ISO TR 14652.
name_fmt should be always defined, other members only if they are commonly used.
6.2.12. LC_ADDRESS
This category is explained in the manual page, POSIX, and ISO TR 14652. The following notes apply for glibc locales:
country_ab2, country_ab3 - two/three-letter ISO 3166 country code
country_name, lang_name - country, language name in this language
country_num - ISO 3166 numeric code in simple numbers without quotes (others like country_isbn are quoted Unicode points as usual)
country_car - international licence plate country code
lang_ab - two-letter ISO 639 code if available, empty if not (when only three-letter code is available)
lang_term - three-letter ISO 639-2/T (Terminology) code
lang_lib - three-letter ISO 639-2/B (Bibliographic) code
Note that if a locale uses a new, previously undefined country and/or language code, they should be added to locale/iso-3166.def and/or to locale/iso-639.def.
Applications should prefer lang_term over lang_lib. There are 20 specific ISO 639-2/B codes, both ISO 639-2/T and ISO 639-2/B are listed at http://www.loc.gov/standards/iso639-2/langhome.html.
6.2.13. LC_TELEPHONE
This category is well explained in the manual page, POSIX, and ISO TR 14652.
6.3. Testing Locales
After modifying a locale, make sure it compiles, and install it to a temporary directory for testing. The following example is ran on the glibc source tree root:
LOCALE=fi_FI export LOCPATH=$HOME/locale-test/ mkdir -p $LOCPATH I18NPATH=./localedata/ localedef -f UTF-8 -i $LOCALE $LOCPATH/$LOCALE.UTF-8 LC_ALL=$LOCALE.UTF-8 locale -ck LC_TIME LC_ALL=$LOCALE.UTF-8 locale -ck date_fmt LC_ALL=$LOCALE.UTF-8 date LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt LC_ALL=$LOCALE.UTF-8 sort < sorting-test-input.txt
These commands prepare a user directory for testing a locale, compile and install the locale, show contents of a category and a value of a certain keyword, and run commands which use different categories. Note that if the locale uses a new country, currency, or language code (see the LC_MONETARY and LC_ADDRESS sections above for details) then it needs to be compiled with localedef utility compiled in the glibc build directory.
If you have set up glibc compilation environment, you can mass-test compilation of all locales by installing to the locale archive and then installing the individual files into their respective directories (exercises both kinds of installs):
make localedata/install-locales DESTDIR=<PATH> make localedata/install-locale-files DESTDIR=<PATH>
The data in the locale-archive file will be used preferentially. If you want to test the installed files you need to remove the locale-archive or install the files in an alternate location and use LOCPATH to locate them. Lastly, note that invoking <prefix>/bin/localedef manually will of course install to the configured --prefix path (unless you use an absolute path as the output).
See localedata/README and localedef(1) for more information about localedef(1).
6.4. Contributing
See Contribution checklist for complete contributing instructions.
When contributing locale updates, always try to get in touch with the locale maintainer first; if this is unsuccessful, try to describe the changes you have made, and (this is important) provide some proofs that this reflects common usage - e.g. local government or big newspapers sites, references to language norms, etc.
Please test your changes before submitting them, see above for testing instructions.
6.5. Miscellaneous Information
6.5.1. Charsets
The iconv conversion internally always works by converting from source charset to UCS-4 and then from UCS-4 to the target charset. This implies that the charset modules need to implement only to/from Unicode mapping, and that characters not in Unicode are not convertable (luckily, this seems to be currently the case only for few obscure ancient kanji characters).
Most of the charsets are simple (single-byte with direct 1-1 Unicode mapping). .c files for these are trivial, depending on data provided by .h files, autogenerated by iconvdata/gen-8bit.sh from localedata/charmaps/ files at build time.
6.5.2. Transliteration
LC_CTYPE specifies the rules how to transliterate characters from one encoding to another, see the above LC_CTYPE section for more details.
7. References
https://www.gnu.org/software/libc/manual/html_node/Locales.html
http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html
http://www.open-std.org/jtc1/SC22/WG20/docs/n972-14652ft.pdf
https://github.com/googlei18n/libaddressinput/wiki/AddressValidationMetadata [https://i18napis.appspot.com/address]