There is a mismatch between strcasecmp and toupper/tolower in the tr_TR.iso88599 locale: #include <stdio.h> #include <stdlib.h> #include <locale.h> #include <ctype.h> #include <strings.h> int main (void) { int i, j, k; char *infs[] = { "INF", "inf" }; if (setlocale (LC_ALL, "") == NULL) { fprintf (stderr, "locale-test: can't set locales\n"); exit (EXIT_FAILURE); } for (i = 0; i < 2; i++) for (j = 0; j < 4; j++) { char s[4]; for (k = 0; k < 3; k++) { s[k] = infs[i][k]; if (j > k) s[k] = (i ? toupper : tolower)(s[k]); } s[3] = '\0'; printf ("%d%d %s\n", !strcasecmp (s, "INF"), !strcasecmp (s, "inf"), s); } return 0; } gives: 11 INF 00 ıNF 00 ınF 00 ınf 11 inf 00 İnf 00 İNf 00 İNF Since the modifications of the string have been done with toupper and tolower, I would have expected 11 everywhere. Tested on Debian/unstable (amd64). Corresponding bug report: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=716775
All that POSIX says on this subject is When the LC_CTYPE category of the current locale is from the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified. I guess that makes it a matter of taste what the results should be. :( I would have expected the following: 10 INF 10 ıNF 10 ınF 10 ınf 01 inf 01 İnf 01 İNf 01 İNF That's because i and ı are different letters in Turkish, whose capitalized equivalents are İ and I. Is there some standard that makes sense of this stuff?
(In reply to Jonathan Nieder from comment #1) > I would have expected the following: > > 10 INF > 10 ıNF > 10 ınF > 10 ınf > 01 inf > 01 İnf > 01 İNf > 01 İNF > > That's because i and ı are different letters in Turkish, whose > capitalized equivalents are İ and I. Yes, you're right. I got confused by the fact strcasecmp currently regards as the ASCII letters i and I as being the same, but it shouldn't.
(In reply to Jonathan Nieder from comment #1) > Is there some standard that makes sense of this stuff? Unicode actually specifies several forms of case-insensitive (caseless) matching. It is more complex than POSIX minimal requirements as normalization should be used (but this makes sense and may be preferable IMHO). Otherwise default caseless matching could be chosen. See: http://www.unicode.org/reports/tr21/
I forgot that POSIX specifies the behavior only in the POSIX locale (for LC_CTYPE). Now, this is still a bug in non-POSIX locales as the glibc manual says: This function is like 'strcmp', except that differences in case are ignored. How uppercase and lowercase characters are related is determined by the currently selected locale. In the standard '"C"' locale the characters A" and a" do not match but in a locale which regards these characters as parts of the alphabet they do match. and in my example, the use of toupper/tolower shows how these characters are related in the selected locale.
This bug is x86-specific. The C implementation does not suffer from this.
Fixed in 45b8acc for 2.19.