Clarify LC_CTYPE / LC_COLLATE, add test cases
|Deletions are marked like this.||Additions are marked like this.|
|Line 94:||Line 94:|
|There are not too many who fully understand the internals of this category. Copying what others have might be sufficient here.||This category deals with character sets and transliteration rules both most often based on Unicode standards. It defines for example what characters are considered to be alphabetic or how to transliterate characters and text from one encoding to another. See other locales for inspiration how to implement these rules. See the ''Testing Locales'' section below for tips how to verify implementation. Note also [[https://sourceware.org/bugzilla/show_bug.cgi?id=14094|bug 14094]] and [[https://sourceware.org/bugzilla/show_bug.cgi?id=16061|bug 16061]].|
|Line 98:||Line 98:|
|This category is very locale specific and one of the hardest to implement. You should use official/national standars as references here and then implement them, perhaps by trying to use other locales as examples.||In many/most countries and languages there are official guidelines and standards on collation rules. Often these are based on the well-known [[http://en.wikipedia.org/wiki/ISO_14651|ISO 14651]] standard or on [[http://www.unicode.org/reports/tr10/|the Unicode collation algorithm]]. See other locales for inspiration how to implement the required collation rules. See the ''Testing Locales'' section below for tips how to verify that the implementation matches the relevant standards. Note also [[https://sourceware.org/bugzilla/show_bug.cgi?id=14095|bug 14095]].|
|Line 183:||Line 183:|
| LC_CTYPE=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt
LC_COLLATE=$LOCALE.UTF-8 sort < sorting-test-input.txt
|Line 185:||Line 187:|
|These commands prepare a user directory for testing a locale, compile and install the locale, show contents of a category, and run a command which uses the example category.||These commands prepare a user directory for testing a locale, compile and install the locale, show contents of a category, and run commands which use different categories.|
Locales in GLIBC
This page provides an introduction to locales in general and a more detailed description on how to use locales (both as a user and a developer) and how to create and update actual locale definition files with the GNU C Library.
Locales in short are collections of language and country specific conventions allowing to adapt software to the user's preferences.
Each locale has following categories to specify related conventions which can then be selected for use by the user:
- LC_IDENTIFICATION - this is not a user-visible category but contains information about the locale itself and is rarely useful for users or developers (but is listed here for completeness sake).
- LC_CTYPE - this category applies to classification and conversion of characters and is relevant when dealing with non-ASCII characters.
- LC_COLLATE - this category applies to collation of strings and is relevant when processing strings or text in a certain language.
- LC_TIME - this category applies to formatting date and time values, for example whether to present the current time in 12 or 24 hour format.
- LC_NUMERIC - this category applies to formatting numeric values that are not monetary, for example the decimal separator.
- LC_MONETARY - this category applies to formatting monetary values, for example the currency symbol.
- LC_MESSAGES - this category applies to selecting the language used in the user interface for message translations and defines expressions for affirmative and negative responses. The software providing the user interface must provide the actual message translations for the selected language for LC_MESSAGES to have the desired effect.
- LC_PAPER - this category applies to selecting the paper size (A4 or US).
- LC_MEASUREMENT - this category applies to selecting the measurement system (metric or US).
- LC_NAME - this category applies to represent a person's name and title.
- LC_ADDRESS - this category applies to postal addresses and country and language names.
- LC_TELEPHONE - this category applies to formatting telephone numbers.
The locales of the GNU C library are named using the pattern ll_CC.SSS where ll refers to language, CC refers to country. The last part SSS refers to the character set to use. For example, the English language as used in Canada using the UTF-8 character set is to be referenced as en_CA.UTF-8. (For more information on characters sets, please see http://www.joelonsoftware.com/articles/Unicode.html.)
The user wishing to use one or more of the above categories can take them in use by defining the corresponding environment variables, for example in the ~/.i18n file which is used on many systems automatically.
There are two additional environment variables which are considered when selecting locales - LANG and LC_ALL. LANG sets the default locale for all categories which can be then overridden by defining additional selected categories. LC_ALL forces the locale for all categories and the selection cannot be overridden. Using LC_ALL is mostly recommended in scripts only (for example, to make sure characters and collation are as the developer was expecting regardless of the user's preferences), while LANG in general should be preferred for the flexibility it provides.
The following example sets the default locale as Mexican Spanish, then sets collation rules to be based on the standard C locale, and finally sets monetary formatting to follow US convetions.
LANG=es_MX.UTF-8 LC_COLLATE=C LC_MONETARY=en_US.UTF-8
Since above LANG is defined but LC_MESSAGES is left undefined, message translations will use Mexican Spanish. The same logic goes for the other undefined categories, like LC_TIME, too.
Developing with Locales
The online GNU C Library manual provides a good starting point for developers creating applications supporting locales. At least the following pages are relevant:
GLIBC Locale Internals
The locale API and definitions of concrete locales are a rather individual part of the GNU C Library; related to this is the subsystem dealing with various charsets and converting between them.
locale/ directory contains the source for the locale API and support tools (localedef, locale, ...).
localedata/ directory (semi-independent with its own README and ChangeLog) contains the localedata/locales/ definitions of the set of locales available on a default GNU system, the localedata/charmaps/ character maps for available charsets, a testsuite and few helper files.
iconv/ directory contains the source for the iconv charset conversion API and duct-tape for gconv modules implementing concrete charsets.
iconvdata/ directory contains the modules for the concrete charsets themselves.
Creating and Updating Locale Data
Please keep in mind the following before starting to work on glibc locales. If in doubt, send a question to libc-alpha/libc-locales mailing lists.
The glibc maintainers cannot easily judge on their own if your new version is correct.
- If your proposed contribution is a non-trivial change, you should get an approval from the locale maintainer or original creator.
- In case they are unreachable or there is a dispute, you should cite authoritative references and/or multiple widely used websites (government pages or websites of major national newspapers are a good choice).
- If the change is not clear-cut, for example you want to switch between two common usages, it is good to gather (and link to) feedback from local user community.
If at all possible, the proposed changes should be aligned with the data in the Unicode Common Locale Data Repository, CLDR - http://cldr.unicode.org/.
The locale definitions are in a specific file format; it is described below but the documentation is not yet complete.
Locale File Format
The locale definition are in a specific file format; some notes on it can be found in locale(5) manual page, but they are sketchy at best. POSIX also describes the format and some fields, but not all that are commonly used in glibc (e.g. week start definitions). All strings in the glibc locale files use Unicode entity specifications instead of plain characters; when working with a locale, to quickly inspect the file just do gcc -o show-ucs-data localedata/show-ucs-data.c (no need to do any build preparation for this, not even ./configure) and then just ./show-ucs-data localedata/locales/en_US.
One additional resource describing locale categories and category members is the ISO/IEC TR 14652:2002(E) Technical Report (PDF). However, some members are described below to address some glibc specific requirements and formatting issues. In case of doubt, please refer to other glibc locales for examples or send a question to libc-alpha/libc-locales mailing lists.
You should provide plenty of comments in the locale file, both about the individual members of each category and also any relevant references. If something is left undefined in purpose, the reason should be stated.
This category is pretty much self-explaining. You should be able to fill this category by using other locales as examples.
This category deals with character sets and transliteration rules both most often based on Unicode standards. It defines for example what characters are considered to be alphabetic or how to transliterate characters and text from one encoding to another. See other locales for inspiration how to implement these rules. See the Testing Locales section below for tips how to verify implementation. Note also bug 14094 and bug 16061.
In many/most countries and languages there are official guidelines and standards on collation rules. Often these are based on the well-known ISO 14651 standard or on the Unicode collation algorithm. See other locales for inspiration how to implement the required collation rules. See the Testing Locales section below for tips how to verify that the implementation matches the relevant standards. Note also bug 14095.
This is one of the most often used categories. The non-obvious members of this category are as follows:
am_pm and t_fmt_ampm - should be left undefined if using 24 hour time
week DAYSINWEEK;WEEKSTARTDATE;MINWEEKLEN - DAYSINWEEK is usually 7; MINWEEKLEN is the minimal length of the first week in year (usually 4). WEEKSTARTDATE is most confusing - it should be some date that corresponds to the beginning of a week. It is typically either 19971130 (Sunday) or 19971201 (Monday).
first_weekday N - number of day in the week to be shown in the first column of a calendar.
first_workday N - number of the first working day in the week.
Furthermore, there is the question of the day keyword and which day of week should its list start with. Specs say Sunday, but they do not mention any of the week start specifiers. Applications aware of these tend to interpret the day list in a more complicated way.
The tricky thing is how to reconcile information from WEEKSTARTDATE and first_weekday. PetrBaudis wrote some lenghty treatises about this on libc-locales; we present the outcome and thus our de facto current interpretation:
WEEKSTARTDATE specifies the base of the day list
first_weekday specifies the offset of the first day-of-week in the day list
- For compatibility reasons, all locales should set WEEKSTARTDATE on 19971130 (Sunday) and base the day list appropriately, and set first_weekday 1 or 2 depending to whether their week actually starts on Sunday or Monday.
Thus, for example en_GB definition (English locale with week starting on Monday) is:
week 7;19971130;4 first_weekday 2 first_workday 2 day "Sunday;Monday;Tuesday;Wednesday;Thursday;Friday;Saturday" abday "Sun;Mon;Tue;Wed;Thu;Fri;Sat"
When your locale is compiled, you can use a simple first_weekday test tool to check the day definitions are correct.
This category has only three members. See POSIX and ISO TR 14652 references and e.g. http://h71000.www7.hp.com/doc/73final/6494/6494pro_003.html for more information on grouping.
This category is well described in POSIX and ISO TR 14652.
This category defines regular expressions to be accepted as positive or negative response and equivalents of Yes and No.
In yesstr and nostr the beginning should reflect whether the beginning of the word is capitalized or in lowercase. This is in line with other specifications in glibc locales, where e.g. day names and month names reflect the use of lower and upper case as prescribed in dictionaries, which records the canonical form of the word or phrase. This is also in accordance with use in POSIX and ISO TR 14652, and general guidelines for definitions in ISO/IEC.
Here A4 is 297x210 and US Letter is 279x216.
Here 1 means metric, 2 means US.
This category is well explained in POSIX and ISO TR 14652.
name_fmt should be always defined, other members only if they are commonly used.
This category is explained in POSIX and ISO TR 14652. The following notes apply for glibc locales:
country_name, lang_name - country, language name in this language
country_num, country_isbn - numeric codes without quotes
lang_ab - two letter code, if available (ultimately, this should be fixed to allow three letter code, too, here) - e.g., de for de_DE
lang_term - three letter code - e.g., deu for de_DE
lang_lib - three letter code - e.g., ger for de_DE
This category is well explained in POSIX and ISO TR 14652.
After modifying a locale, make sure it still compiles and install it to a temporary directory for testing:
unset LC_ALL LOCALE=fi_FI export I18NPATH=$HOME/locale-test/ export LOCPATH=$HOME/locale-test/ mkdir -p $LOCPATH localedef --no-archive -f localedata/charmaps/UTF-8 -i localedata/locales/$LOCALE $I18NPATH/$LOCALE.UTF-8 LANG=$LOCALE.UTF-8 locale -ck LC_TIME LC_TIME=$LOCALE.UTF-8 date LC_CTYPE=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt LC_COLLATE=$LOCALE.UTF-8 sort < sorting-test-input.txt
These commands prepare a user directory for testing a locale, compile and install the locale, show contents of a category, and run commands which use different categories.
If you have set up glibc compilation environment, you can mass-test compilation of all locales:
- make localedata/install-locales install_root=(configure's --prefix)
(And invoking (configure's --prefix)/bin/localedef manually will of course install to the (configure --prefix).)
Contribute the locale updates in the form of bugs in the glibc bugzilla. However, when contributing locale updates, always try to get in touch with the locale maintainer first; if this is unsuccessful, try to describe the changes you have made, and (this is important) provide some proofs that this reflects common usage - e.g. local government or big newspapers sites, references to language norms, etc.
To test your patch or new locale file, use the localedef command; please refer to Contribution checklist#Testing_Locales.
See Contribution checklist for complete contributing instructions.
The iconv conversion internally always works by converting from source charset to UCS-4 and then from UCS-4 to the target charset. This implies that the charset modules need to implement only to/from Unicode mapping, and that characters not in Unicode are not convertable (luckily, this seems to be currently the case only for few obscure ancient kanji characters).
Most of the charsets are simple (single-byte with direct 1-1 Unicode mapping). .c files for these are trivial, depending on data provided by .h files, autogenerated by iconvdata/gen-8bit.sh from localedata/charmaps/ files at build time.