This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Add fast path for strcoll and strcasecmp


On Sun, Nov 23, 2014 at 11:52:06PM +0100, Leonhard Holz wrote:
> Hi OndÅej,
> 
> as far as I understood, the current strcoll implementation scans
> both strings for collation sequences and compares the weights of
> them, whereby a collation sequence can be multiple bytes long. So
> whatever strcmp_l returns as index, you would need a general way of
> finding the start of the collation sequence this index is in.
> Unfortunately I cannot tell if or how this can be done.
> 
As I wrote below you do not have to do that. Just precompute a table that
is zero for characters that are part of some collation sequence and use
old method when one of compared characters is in that table.

>From performance perspective these are not problem as they should be
infrequent enough. Ignored ones are worse as they could make otherwise
identical long prefixes different.


> BTW I have implemented a benchmark for strcoll that is
> not-yet-pushed because I didn't manage to patch the bench-tests
> Makefile to generate additionally needed locales
> (https://sourceware.org/ml/libc-alpha/2014-10/msg00431.html).
> 
> Leonhard
> 
> Am 23.11.2014 22:47, schrieb OndÅej BÃlka:
> >Hi,
> >
> >Now when I looked to strcoll improvement I recalled that strcoll(_l) and
> >strcasecmp could be made faster by following approach. We find first
> >position where strings differ and we will likely decide how they differ
> >by looking at few bytes.
> >
> >For that we need to do two things. First is to determine where character
> >containing differing byte starts. That is easy to do for single byte
> >encodings, UTF and prefix-free encodings in general. Then we need to
> >decide if decision can be made by characters alone. For that we need to
> >autogenerate new weigth table for each locale. It is basically primary
> >weigth table but ignored characters and characters that are part of
> >sequences get zero. If safe weigths are equal or one is zero then we
> >call original function.
> >
> >A strcasecmp works in same way with appropriate collation table.
> >
> >To determine performance gain I made a simple incorrect implementation
> >where I ignored calculating safe weigths. Another optimization is to use
> >strcmp_l optimized assembly which is easy to derive from strcmp by
> >returning index instead how these characters differ.
> >
> >A sample implementation is here, comments?
> >
> >
> >#include <locale.h>
> >#include <xlocale.h>
> >#include <string.h>
> >#include <stdlib.h>
> >
> >static size_t
> >strcmp_l (const char *a, const char *b)
> >{
> >  size_t size = 0;
> >  while (a[size] != b[size] && a[size] != '\000')
> >    size++;
> >  return size;
> >}
> >
> >char safe_characters[256];
> >
> >int (*rewind_weigth_p)(const char *);
> >
> >
> >int rewind_weigth(const char *p)
> >{
> >   return safe_characters[(unsigned char) *p];
> >}
> >
> >static void __attribute__ ((constructor)) init ()
> >{
> >   int i;
> >   rewind_weigth_p = rewind_weigth;
> >   for (i=0; i < 256; i++)
> >     safe_characters[i] = i + 1;
> >}
> >
> >
> >int strcoll (const char *a, const char *b)
> >{
> >   size_t dif = strcmp_l (a, b);
> >
> >   if (a[dif] == '\000')
> >     return 0;
> >
> >   int a_weigth = rewind_weigth_p (a + dif);
> >
> >   int b_weigth = rewind_weigth_p (b + dif);
> >
> >   if (a_weigth && b_weigth && a_weigth != b_weigth)
> >     return a_weigth - b_weigth;
> >   else
> >     return strcoll_l (a, b, newlocale (LC_ALL, setlocale(LC_ALL, NULL), (locale_t) 0));
> >}
> >

-- 

runaway cat on system.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]