This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Note on choosing string hash functions

From: Pedro Alves <palves at redhat dot com>
To: Dmitry Antipov <dantipov at nvidia dot com>, gdb at sourceware dot org
Date: Wed, 22 Nov 2017 02:10:34 +0000
Subject: Re: Note on choosing string hash functions
Authentication-results: sourceware.org; auth=none
References: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com> <4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com>

On 11/17/2017 01:42 PM, Pedro Alves wrote:
> On 11/17/2017 09:48 AM, Dmitry Antipov wrote:
>> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
>>
>> r = r * 67 + c - 113
>>
>> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
>>
>> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
>>
>> Everyone assumes that 'tolower' is simple, fast, and usually implemented
>> as a macro. But when >1M of mangled C++ symbols should be hashed, results
>> may be somewhat surprising ('perf report'):
> 
> I've noticed tolower show up in perf too, along with isspace,
> in strncmp_iw/strncmp_iw_with_mode and friends.
> 
> I think the tolower is there for source languages that
> are case insensitive, like Ada and Fortran.  (Or if you do
> "set case-sensitive off", but I think that's a
> misfeature...).
> 
> I haven't explicitly tried measuring perf with this patch:
> 
>  https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html
> 
> but I think it should improve performance significantly,
> since safe-ctype.h is locale independent.

I looked at performance now, and while there's an improvement,
it isn't nearly as significant as I'd hope.

> (
> Of course, this ignores non-ASCII code points, but I'm not
> sure we really handle non-ASCII correctly everywhere else,
> i.e., whether it makes a difference today.  Certainly the
> locale when the program was compiled generally has no relation
> to the current user's locale, even though they frequently
> match.  If it does make a difference, one way to handle
> it that may be still cheap is to hash all non-ASCII (and utf8
> multi-byte sequences) to the same, while still doing proper
> lowercase comparisons when actually comparing symbols that end
> up in the bucket.  Something like:
> 
> #define SYMBOL_HASH_NEXT(hash, c)			\
>   ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)
> 
> I'd assume that most symbol names are wholly or at least mostly
> ASCII and that that wouldn't result in too many collisions.
> 
> But then again, I don't really know how compilers for case-insensitive
> languages handle non-ASCII symbol names in different cases, especially
> considering multibyte sequences.

I'm looking at merging the rest of that series to master, and
so I've been looking at this issue today, mostly to convince myself
that the tolower -> TOLOWER change can't cause a regression.

It could cause a difference for case-insensitive languages,
if e.g., Latin-1 encoded symbol names could appear in the minimal
symbol tables, _and_ GDB is running with a Latin-1 locale too.
In that case, with tolower, we'd hash "funcá" / "funcÁ" to the same,
but not with TOLOWER, which is strictly ASCII.
I'm pretty convinced that scenario is not real.

First, nowadays UTF-8 locales are pretty much the norm in most
distros (over other ASCII-superset charsets), and tolower on the
individual bytes does nothing on UTF-8.

Then, I played with making Ada/gnat and both Latin-1 and UTF-8 sources
files (the latter with "pragma Wide_Character_Encoding (UTF8)"), and
what I discovered was that Ada's encoding/mangling guarantees that only
ASCII characters end up in mangled names.  From gcc/ada/namet.ads:

~~~
--    Identifiers        Stored with upper case letters folded to lower case.
--                       Upper half (16#80# bit set) and wide characters are
--                       stored in an encoded form (Uhh for upper half char,
--                       Whhhh for wide characters, WWhhhhhhhh as provided by
--                       the routine Append_Encoded, where hh are hex
--                       digits for the character code using lower case a-f).
--                       Normally the use of U or W in other internal names is
--                       avoided, but these letters may be used in internal
--                       names (without this special meaning), if they appear
--                       as the last character of the name, or they are
--                       followed by an upper case letter (other than the WW
--                       sequence), or an underscore.
~~~

Funny enough, GDB doesn't grok this Uhh/WWhhhhhhhh encoding today.
(I wrote a quick patch to teach GDB about it, to help convince myself,
though as is, it only works when gdb's charset/locale is UTF-8.)

Then I looked at Fortran, but I couldn't make gfortran use/grok
non-ASCII identifiers.  Looking at 
  https://gcc.gnu.org/onlinedocs/gfortran/SELECTED_005fCHAR_005fKIND.html
and
  https://github.com/jacobwilliams/json-fortran/wiki/Unicode-How-To

I believe that's just not supported.  So I moved on.

Then there's C/C++.  But even here both GCC and Clang
only allow UTF-8 in identifiers (GCC via UCNs, and Clang because
that's the only charset that it supports), and since
GCC uses UTF-8 internally, I believe that that's the encoding it
always uses for mangling.  That's what I see in my experiments.
I tried --input-charset=ISO-8859-1 --exec-charset=ISO-8859-1
with a Latin-1 source file, and I made sure I had a non-ASCII
identifier as well as a Latin-1 string literal in there, and then confirmed
by debugging the result with gdb that the string literal really was
Latin-1.  Still, the mangled identifier name is encoded in UTF-8
(confirmed with hexdump).  AFAICT, I always see UTF-8 in strings in DWARF
too (and the DWARF spec recommends this).  So tolower/TOLOWER can't make
a difference here either.    That's good, it means the ABI/mangling doesn't
depend on whatever's the host charset, and neither does the DWARF.  It's
possible other compilers/mangling schemes/ABIs do something else, but
at this point I'm not aware of any, but I hope that that's nothing we'd
need to support.

I think Rust is UTF-8 always.  I'd assume Go too, given
its authorship/ancestry...

In the performance aspect, there's a little bit of
improvement.  With this:

$ cat ff.cmd 
set pagination off

define ff
  attach 19018 # Firefox's PID.
  thread apply all bt
  detach
  file
  end

set $count = 10
while $count != 0
  ff
  set $count = $count - 1
  end

$ time $g -q --batch --readnever -x ff.cmd

I observe between 3% / 10% time drop.

(I used Joel's version of --readnever from here:
 https://sourceware.org/ml/gdb-patches/2016-07/msg00103.html)

So in sum, I'm pretty convinced the patch is safe as is.

Thanks,
Pedro Alves

Follow-Ups:
- Re: Note on choosing string hash functions
  - From: Pedro Alves
- Re: Note on choosing string hash functions
  - From: Dmitry Antipov

References:
- Note on choosing string hash functions
  - From: Dmitry Antipov
- Re: Note on choosing string hash functions
  - From: Pedro Alves

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]