This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: hash table subtleties


>  >                m       m          k-m
>  >        p(m) = ( ) (1/n)  (1 - 1/n)    
>  >                k
> 
> That should be k choose m, not m choose k, i.e. - 
> 
>             / k \       m         k-m
>     p(m) = |     | (1/n)   (1-1/n)
>             \ m /
> 
> I assume it's just a typo.

Actually I coded it both ways, found that one gave numbers that looked
like what I wanted, then typed the wrong one into the report.
I have trouble remembering phone numbers too.

> These numbers are off.  They should all be <=1.  They're close enough,

The Stirling formula is not a genuine factorial <shrug>

> but for the hell of it I computed the above directly in guile:

Good to see, I did mine in C using doubles, so guile bignum beats
C doubles for both accuracy and usable range. I'll bet that my program
got the answer quicker though... Actually, I'm glad that you posted the
guile code because it is really short and neat and a good example of
solving math problems using scheme. Now I can check that when utilisation
is constant, the cumulative probablity curve doesn't change with bucket
sizes up to 100.

>  > Given the low utilisation that such a table would have, the probability
>  > of this happening with a hash function that approximates a random
>  > variable is vanishingly small. If it happened, I would be looking for
>  > bugs elsewhere. Much the same as when I hear a loud explosion I don't
>  > think, ``what a surprise, by some amazing fluke a huge number of randomly
>  > moving air molecules all struck my eardrum at exactly the same moment''.
> 
> The two key assumptions are that the hash function approximates a
> random variable and that the input data is randomly distributed.
> Neither is really the case.

I see only one assumption here, the idea of a good hash algorithm
is that it converts non-random input data into what looks like a random
variable. The distribution of the input data really should not matter.

> Have some pity on the poor guy whose
> hacking away & has a bug in his hashing function.  Do you want to blow
> him out of the water with seg faults because of your resizing policy
> or should he just see linear performance?  I think it's important to
> use a method which is robust in the face of all inputs.
> 
>  > I think that growing based on the maximum bucket size should be OK
>  > but attempting to keep constant utilisation is easier to implement and
>  > easier to understand the behaviour.
> 
> The fact is that they'll all be ok on average in practice because
> they'll all exhibit similar behavior on average.  Given that, it makes
> sense to choose the one which has the most robust worst case behavior,
> even if it's unlikely to occur in practice.

OK, how about this proposal (for those who panic about the worst case):
Define ``bucket vector utilisation'' as the number of non-empty buckets
divided by the total buckets (not a difficult number to keep track of).
Determine an upper and lower bound for bucket vector utilisation,
grow when you get bigger than the upper bound, shrink when you get less
than the lower bound (thus still giving some hysteresis).

In the average case (based on the probability stuff) bucket vector
utilisation will be 1 minus the probability of an empty bucket, from
Harvey's figures:

table utilisation              bucket vector utilisation
      1.0                          0.635
      2.0                          0.867
      3.0                          0.952

So on the average, choosing upper and lower bounds for bucket
vector utilisation is much the same as choosing upper and lower
bounds for table utilisation. You have to make sure that your
upper bound is less than 1.0! Finding really good values may take
some fiddling but you have the average case covered and you have
some theory to make you feel righteous.

On the other hand, in the worst case of all keys falling into
the same bucket, the bucket vector utilisation won't change so
the table will never grow (i.e. linear performance with no memory loss).

It should be pointed out that bucket vector utilisation can never
be greater than table utilisation. They will be equal in the rather
amazing case when evey key falls into a different bucket which will
result in a slightly under utilised table with very fast lookup.

Note that it may require more than one grow to GUARANTEE a reduction
of bucket vector utilisation but limiting the system to a single grow
for each insert should still be sufficient.

	- Tel