This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Jay Glascoe <jglascoe@jay.giss.nasa.gov> writes: > On Wed, 28 Oct 1998, Tel wrote: > > > > I'd like to also repeat what I mentioned elsewhere - namely if you > > > shrink the tables on deletions you can easily get the bad behavior of > > > resizing every few operations (on a mix of insertions & deletions) > > > causing the hash tables to give O(n) behavior instead of O(1). > > > They'll be slower than an alist, let alone a balanced tree. > > > > Put some hysteresis into the grow and shrink. For example, when > > utilisation is greater than 3, grow, when it is less than 1, shrink. > > This means that resize cannot possibly occur every few operations. Basically, but not exactly. It depends on how much you grow the tables. If you grow & shrink by a factor of 3, then your example wouldn't yield any hysteresis. For a factor of 2 it'd be ok, but after shrinking the table won't be in the previous state, so I'd recommend 3/4. > my tables grow (double the number of buckets) when the mean nonempty > bucket size exceeds 3, and shrink (halve the number of buckets) when the > mean nonempty bucket size is less than 3 * 119/256 = 1.395 (3 * 1/2 is too > big, 3 * 1/4 is too small, ... 3 * 951/2048 is very close) > > The magic ratio is chosen so that the expected mean nonempty bucket size > after upsize is the same as it is after downsize. (and then, e.g. if a > user inserts just enough entries for the table to grow to 4096 buckets, he > must delete half of them before it will shrink back down to 2048). Yes, this is the property you want, which would be that the min_avg_bucket_size = max_avg_bucket_size/growth_factor^2. So why do you say 3/4 is too small? If the avg bucket size is 3/4 & you half the table size, then the mean bucket size will be 3/2 = what you have after doubling when you've hit the max_avg_bucket_size of 3. The thing I'm most concerned about is averaging the nonempty bucket sizes instead of all the bucket sizes. If you resize on average bucket size then in the worst case (where everything hashes to the same bucket) you'll have 1 bucket with 3*N items & a hash table of size N, yielding O(N) lookup times. If you resize on average *nonempty* bucket size, then your worst case is O(2^N)!!! This is the difference btw bad performance in the worst case & crashing with out of memory in the worst case. If the input data all lands in one bucket, then once that bucket passes max_avg_bucket_size, you'll double the hash table for every insertion. This will happen (slightly slower) whenever you end up with a small bounded number of hash values. When the hash table is well behaved it won't make a difference. When the hash table is missing buckets (probably common), using nonempty_bucket_size_average will resize a little sooner, but it probably doesn't make much of a difference. When the hash function is clustering, then using nonempty_bucket_size_average will rehash much sooner, keeping the alists shorter, but at the expense of having a larger vector. In the worst cases, using nonempty_bucket_size_average causes catastrophic failure whereas using bucket_size_average will be slow but not catastrophic. Is the probably marginal speedup worth the risk of catastrophic failure? -- Harvey J. Stein BFM Financial Research hjstein@bfr.co.il