The UTF-8 charset definition and the i18n locale data (which has the default LC_CTYPE definitions used by almost all locales, for toupper, tolower, isalpha, etc) are quite outdated, using unicode 3.2 data. (there are 3,868 new characters between unicode 3.2 and unicode 5.0) Attached is a patch for UTF-8 and an updated i18n file (as a patch for that one is bigger than the file itself)
Created attachment 1506 [details] patch for the UTF-8 file, adding newly defined characters as of unicode 5.0
Created attachment 1507 [details] updated i18n, with the definitions updated for unicode 5.0
Created attachment 1508 [details] well, if you prefer a diff, here it is
Applied.