Locales in GLIBC

Introduction

This page provides an introduction to locales in general and a more detailed description on how to use locales (both as a user and a developer) and how to create and update actual locale definition files with the GNU C Library.

Overview

Locales in short are collections of language and country specific conventions allowing to adapt software to the user's preferences.

Each locale has following categories to specify related conventions which can then be selected for use by the user:

The locales of the GNU C library are named using the pattern ll_CC.SSS where ll refers to language, CC refers to country, and SSS refers to the character set to use. For example, the English language as used in Canada using the UTF-8 character set is to be referenced as en_CA.UTF-8. (For more information on characters sets, please see http://www.joelonsoftware.com/articles/Unicode.html.)

Using Locales

The user wishing to use one or more of the above categories can take them in use by defining the corresponding environment variables, for example in the ~/.i18n file which is used on many systems automatically.

There are two additional environment variables which are considered when selecting locales - LANG and LC_ALL. LANG sets the default locale for all categories which can be then overridden by defining additional selected categories. LC_ALL forces the locale for all categories and the selection cannot be overridden. Using LC_ALL is mostly recommended in scripts only (for example, to make sure characters and collation are as the developer was expecting regardless of the user's preferences), while LANG in general should be preferred for the flexibility it provides.

The following example sets the default locale as Mexican Spanish, then sets collation rules to be based on the standard C locale, and finally sets monetary formatting to follow US convetions.

  LANG=es_MX.UTF-8
  LC_COLLATE=C
  LC_MONETARY=en_US.UTF-8

Since above LANG is defined but LC_MESSAGES is left undefined, message translations will use Mexican Spanish. The same logic goes for the other undefined categories, like LC_TIME, too.

Developing with Locales

The online GNU C Library manual provides a good starting point for developers creating applications supporting locales. At least the following pages are relevant:

Linux manual pages were greatly improved during 2014 and are now almost complete, the following pages serve as a good starting point:

Application developers should note that while some of the resulting strings (like int_curr_symbol of LC_MONETARY) are required to be of certain length other resulting strings may vary, especially between locales (see this and this email for examples). Thus, testing applications with several locales is recommended to make sure different length strings do not cause inconsistencies in user experience.

GLIBC Locale Internals

The locale API and definitions of concrete locales are a rather individual part of the GNU C Library; related to this is the subsystem dealing with various charsets and converting between them.

Creating and Updating Locale Data

Please keep in mind the following before starting to work on glibc locales. If in doubt, send a question to libc-alpha/libc-locales mailing lists.

Qualifications

The glibc maintainers cannot easily judge on their own if your new version is correct.

The locale definitions are in a specific file format; it is described below and in the manual pages, please read them both for complete understanding.

Locale File Format

The locale definition are in a specific file format; most relevant notes can be found from the locale(5) manual page. POSIX also describes the format and some fields, but not all that are commonly used in glibc (e.g. week start definitions). All strings in the glibc locale files use Unicode entity specifications instead of plain characters; when working with a locale, to quickly inspect the file just do gcc -o show-ucs-data localedata/show-ucs-data.c (no need to do any build preparation for this, not even ./configure) and then just ./show-ucs-data localedata/locales/en_US.

One additional resource describing locale categories and category members is the ISO/IEC TR 14652:2002(E) Technical Report (PDF). However, some members are described below to address some glibc specific requirements and formatting issues. In case of doubt, please refer to manual pages, other glibc locales for examples, or send a question to libc-alpha/libc-locales mailing lists.

Comments

You should provide plenty of comments in the locale file, both about the individual members of each category and also any relevant references. If something is left undefined on purpose, the reason should be stated.

LC_IDENTIFICATION

This category is pretty much self-explaining. You should be able to fill this category by using other locales as examples.

LC_CTYPE

This category deals with character sets and transliteration rules both most often based on Unicode standards. It defines for example what characters are considered to be alphabetic or how to transliterate characters and text from one encoding to another. See other locales for inspiration how to implement these rules. See the Testing Locales section below for tips how to verify implementation. Note also bug 14094 and bug 16061.

LC_COLLATE

In many/most countries and languages there are official guidelines and standards on collation rules. Often these are based on the well-known ISO 14651 standard or on the Unicode collation algorithm. See other locales for inspiration how to implement the required collation rules. See the Testing Locales section below for tips how to verify that the implementation matches the relevant standards. Note also bug 14095.

LC_TIME

This is one of the most often used categories. The non-obvious members of this category are as follows:

Furthermore, there is the question of the abday and day keywords and which day of week should the lists start with. Specs say Sunday, but they do not mention any of the week start specifiers. Applications aware of these tend to interpret the abday and day lists in a more complicated way.

The tricky thing is how to reconcile information from WEEKSTARTDATE and first_weekday. PetrBaudis wrote some lenghty treatises about this on libc-locales; we present the outcome and thus our de facto current interpretation:

Thus, for example en_GB definition (English locale with week starting on Monday) is:

  week          7;19971130;4
  first_weekday 2
  first_workday 2
  day           "Sunday;Monday;Tuesday;Wednesday;Thursday;Friday;Saturday"
  abday         "Sun;Mon;Tue;Wed;Thu;Fri;Sat"

When your locale is compiled, you can use a simple first_weekday test tool to check the day definitions are correct.

LC_NUMERIC

This category has only three members. See the manual page, POSIX, and ISO TR 14652 references and e.g. http://h71000.www7.hp.com/doc/73final/6494/6494pro_003.html for more information on grouping.

LC_MONETARY

This category is well described in the manual page, POSIX, and ISO TR 14652 references.

LC_MESSAGES

This category defines regular expressions to be accepted as positive or negative response and equivalents of yes and no.

In yesstr and nostr the beginning should reflect whether the beginning of the word is capitalized or in lowercase. This is in line with other specifications in glibc locales, where e.g. day names and month names reflect the use of lower and upper case as prescribed in dictionaries, which records the canonical form of the word or phrase. This is also in accordance with use in POSIX and ISO TR 14652, and general guidelines for definitions in ISO/IEC.

LC_PAPER

Here A4 is 297x210 and US Letter is 279x216.

LC_MEASUREMENT

Here 1 means metric, 2 means US.

LC_NAME

This category is well explained in the manual page, POSIX, and ISO TR 14652.

name_fmt should be always defined, other members only if they are commonly used.

LC_ADDRESS

This category is explained in the manual page, POSIX, and ISO TR 14652. The following notes apply for glibc locales:

Applications should prefer lang_term over lang_lib. There are 20 specific ISO 639-2/B codes, both ISO 639-2/T and ISO 639-2/B are listed at http://www.loc.gov/standards/iso639-2/langhome.html.

LC_TELEPHONE

This category is well explained in the manual page, POSIX, and ISO TR 14652.

Testing Locales

After modifying a locale, make sure it compiles, and install it to a temporary directory for testing. The following example is ran on the glibc source tree root:

  LOCALE=fi_FI
  export LOCPATH=$HOME/locale-test/
  mkdir -p $LOCPATH
  I18NPATH=./localedata/ localedef -f UTF-8 -i $LOCALE $LOCPATH/$LOCALE.UTF-8
  LC_ALL=$LOCALE.UTF-8 locale -ck LC_TIME
  LC_ALL=$LOCALE.UTF-8 locale -ck date_fmt
  LC_ALL=$LOCALE.UTF-8 date
  LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt
  LC_ALL=$LOCALE.UTF-8 sort < sorting-test-input.txt

These commands prepare a user directory for testing a locale, compile and install the locale, show contents of a category and a value of a certain keyword, and run commands which use different categories.

If you have set up glibc compilation environment, you can mass-test compilation of all locales:

(And invoking (configure's --prefix)/bin/localedef manually will of course install to the (configure --prefix).)

See localedata/README and localedef(1) for more information about localedef(1).

Contributing

Contribute the locale updates in the form of bugs in the glibc bugzilla. However, when contributing locale updates, always try to get in touch with the locale maintainer first; if this is unsuccessful, try to describe the changes you have made, and (this is important) provide some proofs that this reflects common usage - e.g. local government or big newspapers sites, references to language norms, etc.

Please test your changes before submitting them, see above for testing instructions.

See Contribution checklist for complete contributing instructions.

Miscallaneous Information

Charsets

The iconv conversion internally always works by converting from source charset to UCS-4 and then from UCS-4 to the target charset. This implies that the charset modules need to implement only to/from Unicode mapping, and that characters not in Unicode are not convertable (luckily, this seems to be currently the case only for few obscure ancient kanji characters).

Most of the charsets are simple (single-byte with direct 1-1 Unicode mapping). .c files for these are trivial, depending on data provided by .h files, autogenerated by iconvdata/gen-8bit.sh from localedata/charmaps/ files at build time.

References

None: Locales (last edited 2014-06-24 15:52:04 by MarkoMyllynen)