This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: strxfrm output stability
- From: Paul Eggert <eggert at cs dot ucla dot edu>
- To: Zack Weinberg <zackw at panix dot com>, libc-alpha at sourceware dot org
- Date: Wed, 9 Sep 2015 15:25:21 -0700
- Subject: Re: strxfrm output stability
- Authentication-results: sourceware.org; auth=none
- References: <55EF4F95 dot 4020703 at redhat dot com> <20150908211805 dot 36E5E2C3A73 at topped-with-meat dot com> <55EF529E dot 7070108 at redhat dot com> <55EF5494 dot 8030506 at cs dot ucla dot edu> <55F01BDC dot 70908 at redhat dot com> <55F06230 dot 8080003 at cs dot ucla dot edu> <55F0644C dot 8050703 at panix dot com> <55F082E8 dot 30006 at cs dot ucla dot edu> <55F08C57 dot 7030100 at panix dot com>
On 09/09/2015 12:45 PM, Zack Weinberg wrote:
On 09/09/2015 03:05 PM, Paul Eggert wrote:
When I tried a decade ago to use strxfrm in GNU sort to speed it up, I
ran into two problems. First, glibc strxfrm generates lengthy output,
and since I wanted only the first 8 bytes or so I got relatively little
information out of the prefix and overall it was a performance loss on
my benchmarks.
That's interesting, because as *I* understand it, that is exactly what
PostgreSQL is doing
Again, I looked at the 9.4.4 code, but it is doing something fancier and
I imagine 9.5 will too. First, it's computing strxfrm for a range of
keys, and if the strxfrm transformations of the minimum and maximum keys
agree in the first B bytes, it discards the length-B prefix of all the
strxfrm transformations. Second, the code looks at the individual bytes
of the strxfrm output, and if they're all (say) in the range 'A' through
'Z' it compresses each of them to ~4.7bits (log base 2 of 26).
PostgreSQL can do fun stuff like that because it uses floating-point
comparators.
they apparently got burnt a time or two by some C
libraries reporting strings as strcoll-identical when they aren't
byte-identical).
As I understand it POSIX allows this behavior, i.e., strcoll is allowed
to return 0 even when its arguments are not byte-for-byte identical. I
observed this behavior in glibc with this program:
#include <locale.h>
#include <string.h>
#include <stdio.h>
int
main (void)
{
if (! setlocale (LC_ALL, "en_US.UTF-8"))
perror ("setlocale");
return strcoll("\xc9\xa2", "\xc9\xac") == 0;
}
On my Fedora host this silently exits with status 1. Although this is
arguably a bug, I don't see where it violates POSIX.
It seems like there are two distinct use cases here: 1) grab a short,
fixed-length tag (effectively a collation-preserving hash); collisions
are OK, caller promises to fall back to strcoll() (PostgreSQL,
sort(1). 2) smuggle a locale-aware comparison into an API that only
allows you to provide a key, not a comparator (Python 3). Maybe it's
time to invent a new API? zw
More generally for case (1) one may also want to discard a leading
prefix as described above.