Bug 15736 - mismatch between strcasecmp and toupper/tolower in tr_TR.iso88599 locale
Summary: mismatch between strcasecmp and toupper/tolower in tr_TR.iso88599 locale
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: 2.17
: P2 normal
Target Milestone: 2.19
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-12 17:50 UTC by Vincent Lefèvre
Modified: 2021-08-19 15:34 UTC (History)
3 users (show)

See Also:
Host: x86_64-*-*, i686-*-*
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefèvre 2013-07-12 17:50:51 UTC
There is a mismatch between strcasecmp and toupper/tolower in the tr_TR.iso88599 locale:

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <ctype.h>
#include <strings.h>

int main (void)
{
  int i, j, k;
  char *infs[] = { "INF", "inf" };

  if (setlocale (LC_ALL, "") == NULL)
    {
      fprintf (stderr, "locale-test: can't set locales\n");
      exit (EXIT_FAILURE);
    }

  for (i = 0; i < 2; i++)
    for (j = 0; j < 4; j++)
      {
        char s[4];
        for (k = 0; k < 3; k++)
          {
            s[k] = infs[i][k];
            if (j > k)
              s[k] = (i ? toupper : tolower)(s[k]);
          }
        s[3] = '\0';
        printf ("%d%d %s\n",
                !strcasecmp (s, "INF"), !strcasecmp (s, "inf"), s);
      }

  return 0;
}

gives:

11 INF
00 ıNF
00 ınF
00 ınf
11 inf
00 İnf
00 İNf
00 İNF

Since the modifications of the string have been done with toupper and tolower, I would have expected 11 everywhere.

Tested on Debian/unstable (amd64). Corresponding bug report:
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=716775
Comment 1 Jonathan Nieder 2013-07-15 06:36:15 UTC
All that POSIX says on this subject is

        When the LC_CTYPE category of the current locale is from the
        POSIX locale, strcasecmp() and strncasecmp() shall behave as
        if the strings had been converted to lowercase and then a byte
        comparison performed. Otherwise, the results are unspecified.

I guess that makes it a matter of taste what the results should be. :(

I would have expected the following:

        10 INF
        10 ıNF
        10 ınF
        10 ınf
        01 inf
        01 İnf
        01 İNf
        01 İNF

That's because i and ı are different letters in Turkish, whose
capitalized equivalents are İ and I.

Is there some standard that makes sense of this stuff?
Comment 2 Vincent Lefèvre 2013-07-15 08:17:29 UTC
(In reply to Jonathan Nieder from comment #1)
> I would have expected the following:
> 
>         10 INF
>         10 ıNF
>         10 ınF
>         10 ınf
>         01 inf
>         01 İnf
>         01 İNf
>         01 İNF
> 
> That's because i and ı are different letters in Turkish, whose
> capitalized equivalents are İ and I.

Yes, you're right. I got confused by the fact strcasecmp currently regards as the ASCII letters i and I as being the same, but it shouldn't.
Comment 3 Vincent Lefèvre 2013-07-15 09:00:10 UTC
(In reply to Jonathan Nieder from comment #1)
> Is there some standard that makes sense of this stuff?

Unicode actually specifies several forms of case-insensitive (caseless) matching. It is more complex than POSIX minimal requirements as normalization should be used (but this makes sense and may be preferable IMHO). Otherwise default caseless matching could be chosen. See:

  http://www.unicode.org/reports/tr21/
Comment 4 Vincent Lefèvre 2013-07-15 15:34:04 UTC
I forgot that POSIX specifies the behavior only in the POSIX locale (for LC_CTYPE).

Now, this is still a bug in non-POSIX locales as the glibc manual says:

     This function is like 'strcmp', except that differences in case are
     ignored.  How uppercase and lowercase characters are related is
     determined by the currently selected locale.  In the standard '"C"'
     locale the characters A" and a" do not match but in a locale which
     regards these characters as parts of the alphabet they do match.

and in my example, the use of toupper/tolower shows how these characters are related in the selected locale.
Comment 5 Andreas Schwab 2013-07-16 19:14:49 UTC
This bug is x86-specific.  The C implementation does not suffer from this.
Comment 6 Andreas Schwab 2013-08-27 10:27:18 UTC
Fixed in 45b8acc for 2.19.