This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: hash table subtleties


Tel <telford@eng.uts.edu.au> writes:

 > > Tcl hash tables resize when the number of items = 3 x number of
 > > buckets.  At this point it increases the number of buckets by a factor
 > > of 4.  This prevents resizing too often, and maybe even is more
 > > reasonable than resizing based on max bucket size because if
 > > everything's getting tossed into one bucket, why keep doubling the
 > > number of buckets?  This'll just suck up tons of memory for little
 > > effect.  I don't know if 3 & 4 are the best numbers, but at least this
 > > method won't blow up in the worst case - i.e. - yield a hash table of
 > > size 2^(number of elements).  Also, it's less bookkeeping.
 > 
 > A hash table with n buckets is storing k keys.
 > The probability of a given key falling into a given bucket
 > with a perfectly random hashing function is 1/n.
 > Using the binomial theorem, the probability of a given
 > bucket containing exactly m keys is:
 > 
 >                m       m          k-m
 >        p(m) = ( ) (1/n)  (1 - 1/n)    
 >                k

That should be k choose m, not m choose k, i.e. - 

            / k \       m         k-m
    p(m) = |     | (1/n)   (1-1/n)
            \ m /

I assume it's just a typo.

 > I define utilisation of the hash table as the number of stored keys
 > divided by the number of buckets (i.e. the average keys per bucket)
 > so n * u = k.

 > Using the Stirling approximation of a factorial in order to calculate
 > the number of combinations I was able to plot cumulative probabilities
 > for a constant utilisation of 3 and with n=10 buckets up to n=55 buckets
 > (after that the factorials get too big). Over this region the number of
 > buckets has NO EFFECT on cumulative probability curve, given that the
 > utilisation is fixed. I would guess that the shape of the cumulative
 > probability curve is only dependent on the utilisation.
 > 
 > I also tried fixing the buckets at 55 and changing the utilisation:
 > 
 > # utilisation, keys in bucket, cumulative probability
 >  1 0 0.36451
 >  1 1 0.73578
 >  1 2 0.929255
 >  1 3 0.991696
 >  1 4 1.00663
 >  1 5 1.00943
 > 
 >  2 0 0.132867
 >  2 1 0.403524
 >  2 2 0.688219
 >  2 3 0.875442
 >  2 4 0.96755
 >  2 5 1.00356
 > 
 >  3 0 0.0484314
 >  3 1 0.196416
 >  3 2 0.43062
 >  3 3 0.663072
 >  3 4 0.836216
 >  3 5 0.939033

These numbers are off.  They should all be <=1.  They're close enough,
but for the hell of it I computed the above directly in guile:

(define (subfact n k)
  (define (subfactloc n k result)
    (cond ((< n k) result)
	  (else (subfactloc (- n 1) k (* n result)))))
  (cond ((< n k) 1)
	(else (subfactloc n k 1))))

(define (choose n k)
  (/ (subfact n (- n k -1))
     (subfact k 1)))

(define (p n k m)
  (/ (* (choose k m)
	(expt (- n 1) (- k m)))
     (expt n k)))

(define (cum-p n k m)
  (do ((i m (- i 1))
       (r 0))
      ((< i 0) r)
    (set! r (+ r (p n k i)))))

(define (tabulate n k l f)
  (do ((m 0 (1+ m)))
      ((> m l))
    (display m)
    (display "   ")
    (display (f n k m))
    (newline)))

guile> (tabulate 55 55 5 cum-p)
0   0.364509513859422
1   0.735769203901426
2   0.921399048922428
3   0.982129800688558
4   0.996750166854479
5   0.999511791574708
guile> (tabulate 55 110 5 cum-p)
0   0.132867185694032
1   0.403522563959654
2   0.676684010542549
3   0.858791641597813
4   0.949002366240929
5   0.984418428508227
guile> (tabulate 55 165 5 cum-p)
0   0.0484313532652013
1   0.196416043797761
2   0.421133536828684
3   0.647238174878317
4   0.816816653415542
5   0.91793567209885

I can get up to around 100 or so before guile starts having trouble
dividing bignums.

 > > In fact, the more I think about it, the less it seems like growing
 > > based on max bucket size is a good idea.  If we start with a table
 > > size of 4 & double the table size when the max bucket size passes 3,
 > > and we have the bad luck that everything hashes to the same value,
 > > then after n insertions we end up with an array of size 2^(n-1)
 > 
 > Given the low utilisation that such a table would have, the probability
 > of this happening with a hash function that approximates a random
 > variable is vanishingly small. If it happened, I would be looking for
 > bugs elsewhere. Much the same as when I hear a loud explosion I don't
 > think, ``what a surprise, by some amazing fluke a huge number of randomly
 > moving air molecules all struck my eardrum at exactly the same moment''.

The two key assumptions are that the hash function approximates a
random variable and that the input data is randomly distributed.
Neither is really the case.  Have some pity on the poor guy whose
hacking away & has a bug in his hashing function.  Do you want to blow
him out of the water with seg faults because of your resizing policy
or should he just see linear performance?  I think it's important to
use a method which is robust in the face of all inputs.

 > I think that growing based on the maximum bucket size should be OK
 > but attempting to keep constant utilisation is easier to implement and
 > easier to understand the behaviour.

The fact is that they'll all be ok on average in practice because
they'll all exhibit similar behavior on average.  Given that, it makes
sense to choose the one which has the most robust worst case behavior,
even if it's unlikely to occur in practice.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il