Differences between revisions 6 and 7
Revision 6 as of 2013-05-04 08:50:24
Size: 5853
Editor: KeldSimonsen
Comment: Add yesstr/nostr
Revision 7 as of 2013-05-04 08:52:05
Size: 5854
Editor: KeldSimonsen
Comment:
Deletions are marked like this. Additions are marked like this.
Line 66: Line 66:
In yesstr and nostr the beginning should reflect whether the beginning of the word i capitalized or in lowercase. This is in line with other specifications in glibc locales, where e.g. day names and month names reflect the use of lower and upper case as prescribed in dictionaries, which records the canonical form of the word or phrase. This is also in accordance with use in POSIX and ISO TR 14652, and general guidelines for definitions in ISO/IEC. In yesstr and nostr the beginning should reflect whether the beginning of the word is capitalized or in lowercase. This is in line with other specifications in glibc locales, where e.g. day names and month names reflect the use of lower and upper case as prescribed in dictionaries, which records the canonical form of the word or phrase. This is also in accordance with use in POSIX and ISO TR 14652, and general guidelines for definitions in ISO/IEC.

Locales in GLIBC

Background

For a general background in character sets and conversions please read this first http://www.joelonsoftware.com/articles/Unicode.html

Overview

One large and relatively independent part of glibc is the locale API and definitions of concrete locales; related to this is the subsystem dealing with various charsets and converting between them.

  • locale/ directory contains the source for the locale API and support tools (localedef, locale, ...).

  • localedata/ directory (semi-independent with its own README and ChangeLog) contains the localedata/locales/ definitions of the set of locales available on a default GNU system, the localedata/charmaps/ character maps for available charsets, a testsuite and few helper files.

  • iconv/ directory contains the source for the iconv charset conversion API and duct-tape for gconv modules implementing concrete charsets.

  • iconvdata/ directory contains the modules for the concrete charsets themselves.

Locale data

The locale definitions are in a specific file format; some notes on it can be found in locale(5) manpage, but they are sketchy at best. POSIX also describes the format and some fields, but not all that are commonly used in glibc (e.g. week start definitions).

All strings in the file use Unicode entity specifications instead of plain characters; to quickly inspect the file, gcc -o show-ucs-data localedata/show-ucs-data.c (no need to do any build preparation for this, not even ./configure) and then just ./show-ucs-data localedata/locales/en_US.

Externally available data for a locales can be useful for cross-referencing including ICU - International Components for Unicode.

Contributing

Contribute the locale updates in the form of bugs in the glibc bugzilla. However, when contributing locale updates, always try to get in touch with the locale maintainer first; if this is unsuccessful, try to describe the changes you have made, and (this is important) provide some proofs that this reflects common usage - e.g. local government or big newspapers sites, references to language norms, etc.

To test your patch or new locale file, use the localedef command; please refer to Contribution checklist#Testing_Locales.

Week start

More and more applications start to rely on locale data when displaying calendar views, etc. - in some locales, the first column should be Monday, in others Sunday is appropriate. Unfortunately, this data is not currently in very good shape, in part given the confusing way the week start is declared.

There are three keywords for the LC_TIME section related to this:

  • week DAYSINWEEK;WEEKSTARTDATE;MINWEEKLEN - DAYSINWEEK is usually 7; MINWEEKLEN is the minimal length of the first week in year (usually 4). WEEKSTARTDATE is most confusing - it should be some date that corresponds to the beginning of a week. It is typically either 19971130 (Sunday) or 19971201 (Monday).

  • first_weekday N - number of day in the week to be shown in the first column of a calendar.

  • first_workday N - number of the first working day in the week.

Furthermore, there is the question of the day keyword and which day of week should its list start with. Specs say Sunday, but they do not mention any of the week start specifiers. Applications aware of these tend to interpret the day list in a more complicated way.

The tricky thing is how to reconcile information from WEEKSTARTDATE and first_weekday. PetrBaudis wrote some lenghty treatises about this on libc-locales; we present the outcome and thus our de facto current interpretation:

  • WEEKSTARTDATE specifies the base of the day list

  • first_weekday specifies the offset of the first day-of-week in the day list

  • For compatibility reasons, all locales should set WEEKSTARTDATE on 19971130 (Sunday) and base the day list appropriately, and set first_weekday 1 or 2 depending to whether their week actually starts on Sunday or Monday.

Thus, e.g. en_GB definition (English locale with week starting on Monday) is:

 week          7;19971130;4
 first_weekday 2
 first_workday 2
 day           "Sunday;Monday;Tuesday;Wednesday;Thursday;Friday;Saturday"
 abday         "Sun;Mon;Tue;Wed;Thu;Fri;Sat"

When your locale is compiled, you can use a simple first_weekday test tool to check the day definitions are correct.

Charsets

The iconv conversion internally always works by converting from source charset to UCS-4 and then from UCS-4 to the target charset. This implies that the charset modules need to implement only to/from Unicode mapping, and that characters not in Unicode are not convertable (luckily, this seems to be currently the case only for few obscure ancient kanji characters).

Most of the charsets are simple (single-byte with direct 1-1 Unicode mapping). .c files for these are trivial, depending on data provided by .h files, autogenerated by iconvdata/gen-8bit.sh from localedata/charmaps/ files at build time.

yesstr/nostr

In yesstr and nostr the beginning should reflect whether the beginning of the word is capitalized or in lowercase. This is in line with other specifications in glibc locales, where e.g. day names and month names reflect the use of lower and upper case as prescribed in dictionaries, which records the canonical form of the word or phrase. This is also in accordance with use in POSIX and ISO TR 14652, and general guidelines for definitions in ISO/IEC.

None: Locales (last edited 2013-05-04 08:52:05 by KeldSimonsen)