Note on choosing string hash functions
Pedro Alves
palves@redhat.com
Wed Nov 22 02:10:00 GMT 2017
On 11/17/2017 01:42 PM, Pedro Alves wrote:
> On 11/17/2017 09:48 AM, Dmitry Antipov wrote:
>> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
>>
>> r = r * 67 + c - 113
>>
>> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
>>
>> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
>>
>> Everyone assumes that 'tolower' is simple, fast, and usually implemented
>> as a macro. But when >1M of mangled C++ symbols should be hashed, results
>> may be somewhat surprising ('perf report'):
>
> I've noticed tolower show up in perf too, along with isspace,
> in strncmp_iw/strncmp_iw_with_mode and friends.
>
> I think the tolower is there for source languages that
> are case insensitive, like Ada and Fortran. (Or if you do
> "set case-sensitive off", but I think that's a
> misfeature...).
>
> I haven't explicitly tried measuring perf with this patch:
>
> https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html
>
> but I think it should improve performance significantly,
> since safe-ctype.h is locale independent.
I looked at performance now, and while there's an improvement,
it isn't nearly as significant as I'd hope.
> (
> Of course, this ignores non-ASCII code points, but I'm not
> sure we really handle non-ASCII correctly everywhere else,
> i.e., whether it makes a difference today. Certainly the
> locale when the program was compiled generally has no relation
> to the current user's locale, even though they frequently
> match. If it does make a difference, one way to handle
> it that may be still cheap is to hash all non-ASCII (and utf8
> multi-byte sequences) to the same, while still doing proper
> lowercase comparisons when actually comparing symbols that end
> up in the bucket. Something like:
>
> #define SYMBOL_HASH_NEXT(hash, c) \
> ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)
>
> I'd assume that most symbol names are wholly or at least mostly
> ASCII and that that wouldn't result in too many collisions.
>
> But then again, I don't really know how compilers for case-insensitive
> languages handle non-ASCII symbol names in different cases, especially
> considering multibyte sequences.
I'm looking at merging the rest of that series to master, and
so I've been looking at this issue today, mostly to convince myself
that the tolower -> TOLOWER change can't cause a regression.
It could cause a difference for case-insensitive languages,
if e.g., Latin-1 encoded symbol names could appear in the minimal
symbol tables, _and_ GDB is running with a Latin-1 locale too.
In that case, with tolower, we'd hash "funcá" / "funcÃ" to the same,
but not with TOLOWER, which is strictly ASCII.
I'm pretty convinced that scenario is not real.
First, nowadays UTF-8 locales are pretty much the norm in most
distros (over other ASCII-superset charsets), and tolower on the
individual bytes does nothing on UTF-8.
Then, I played with making Ada/gnat and both Latin-1 and UTF-8 sources
files (the latter with "pragma Wide_Character_Encoding (UTF8)"), and
what I discovered was that Ada's encoding/mangling guarantees that only
ASCII characters end up in mangled names. From gcc/ada/namet.ads:
~~~
-- Identifiers Stored with upper case letters folded to lower case.
-- Upper half (16#80# bit set) and wide characters are
-- stored in an encoded form (Uhh for upper half char,
-- Whhhh for wide characters, WWhhhhhhhh as provided by
-- the routine Append_Encoded, where hh are hex
-- digits for the character code using lower case a-f).
-- Normally the use of U or W in other internal names is
-- avoided, but these letters may be used in internal
-- names (without this special meaning), if they appear
-- as the last character of the name, or they are
-- followed by an upper case letter (other than the WW
-- sequence), or an underscore.
~~~
Funny enough, GDB doesn't grok this Uhh/WWhhhhhhhh encoding today.
(I wrote a quick patch to teach GDB about it, to help convince myself,
though as is, it only works when gdb's charset/locale is UTF-8.)
Then I looked at Fortran, but I couldn't make gfortran use/grok
non-ASCII identifiers. Looking at
https://gcc.gnu.org/onlinedocs/gfortran/SELECTED_005fCHAR_005fKIND.html
and
https://github.com/jacobwilliams/json-fortran/wiki/Unicode-How-To
I believe that's just not supported. So I moved on.
Then there's C/C++. But even here both GCC and Clang
only allow UTF-8 in identifiers (GCC via UCNs, and Clang because
that's the only charset that it supports), and since
GCC uses UTF-8 internally, I believe that that's the encoding it
always uses for mangling. That's what I see in my experiments.
I tried --input-charset=ISO-8859-1 --exec-charset=ISO-8859-1
with a Latin-1 source file, and I made sure I had a non-ASCII
identifier as well as a Latin-1 string literal in there, and then confirmed
by debugging the result with gdb that the string literal really was
Latin-1. Still, the mangled identifier name is encoded in UTF-8
(confirmed with hexdump). AFAICT, I always see UTF-8 in strings in DWARF
too (and the DWARF spec recommends this). So tolower/TOLOWER can't make
a difference here either. That's good, it means the ABI/mangling doesn't
depend on whatever's the host charset, and neither does the DWARF. It's
possible other compilers/mangling schemes/ABIs do something else, but
at this point I'm not aware of any, but I hope that that's nothing we'd
need to support.
I think Rust is UTF-8 always. I'd assume Go too, given
its authorship/ancestry...
In the performance aspect, there's a little bit of
improvement. With this:
$ cat ff.cmd
set pagination off
define ff
attach 19018 # Firefox's PID.
thread apply all bt
detach
file
end
set $count = 10
while $count != 0
ff
set $count = $count - 1
end
$ time $g -q --batch --readnever -x ff.cmd
I observe between 3% / 10% time drop.
(I used Joel's version of --readnever from here:
https://sourceware.org/ml/gdb-patches/2016-07/msg00103.html)
So in sum, I'm pretty convinced the patch is safe as is.
Thanks,
Pedro Alves
More information about the Gdb
mailing list