circular random number distribution

Tue Aug 9 12:32:00 GMT 2005

Jochen Küpper writes:

> Ok, I did some quick tests, the times are for the creation of 1e5 xy
> pairs plus some constant cruft done in all runs -- runtime is always
> 3.4s (+-0.5) for all approaches.... (This is not a good benchmark(!),
> it's only meant to give me an idea of what is going on.)

To do a "proper" benchmark, wrap the core loop in a pair of
gettimeofday() calls, making sure that you loop inside the pair order of
a second.  Extract the start/end timings, subtract, print out the delta.
Also beware WHICH rng you are using -- if you are using e.g. rndlxd2 it
is SO slow that IT becomes the rate limiting factor instead of the
trancendentals, which might explain the relatively uniform times.

You'll only see differentiation if the time required to generate a rand
is <= the time for a trancendental call, which is the case for many high
quality rngs in the GSL but (as you can see below) NOT ALL!  I would
generally recommend the mt19937 generators as being BOTH fast AND high
quality -- they manage to get through the five-bit random test in my
nascent random number tester, which is the best that any of them do.
None of the rngs (I've tested) manage to generate rands with all
possible six bit patterns uniformly distributed with a decent p-value,
and of course that means that 7+ bits are also screwed.  Or else I'm not
testing right, which is still very much a possibility...;-)

> I tried:
>   rejection,
>   trigonomtric without sqrt,
>   trigonomtrix + sqrt,
>   gsl_ran_dir_2d + sqrt,
>   gsl_ran_dir_2d_trig_method + sqrt.
>
> "Robert G. Brown" <rgb@phy.duke.edu> writes:
>
>>> | phi = gsl_ran_flat(rng, 0, 2.0*PI);
>>> | r = gsl_ran_flat(rng, 0, 1);
>>> | x = sqrt(r)*cos(phi);
>>> | y = sqrt(r)*sin(phi);
>
>> I'm probably late with this (sorry, busy weekend) but: Your
>> accept-reject method is almost certainly much faster than using
>> ANYTHING with transcendental calls in it.
>
> Well, I am told that some machines (i.e. PCs) have these functions in
> hardware.

They are, sort of (in the modern decendants of the venerable 8087
instruction set).  However, "in hardware" does not necessarily translate
into "as fast as floating point multiplication and addition" which is
what most random number generators need to make a new rand.  According
to my direct benchmark of trancendentals (the savage benchmark, if
anybody remembers it: tan(atan(exp(log(sqrt(xtest*xtest)))))) my current
laptop kicks out 2.44x10^6 of these per second.  A direct benchmark of
the mt19937_1999 GSL rng on my laptop yields something like 15x10^6 per
second on the same laptop.  One "wastes" (4 - pi)/4 rng pairs with
accept/reject -- call it 22% -- and so you have to generate one
extra/wasted rand pair for every four you keep.  This costs you 22% of
132 nanoseconds -- call it 30 nanoseconds -- per successful accept.

OTOH, each trancendental call costs perhaps 82 nanoseconds.  You need
three of them (if you rewrite your code to avoid the extra sqrt call,
since it is much faster to do:

{ double sqrtr;
  sqrtr = sqrt(r);
  x = sqrtr*cos(phi);
  y = sqrtr*sin(phi);
}

than to call sqrt twice.  In fact it is probably faster to generate y
from x and r and a sqrt call since sqrt is probably a couple or three
times faster than sin or cos, but this costs you precision.

In summary, one EXPECTS it to cost approximately (2*66 + 30 + 8 =) 170
nanoseconds per xy pair with accept reject (on my centrino laptop at a
throttled 800 MHz -- scale down by clock and architecture and YMMV),
allowing a few nanoseconds for the bookkeeping and conditional.  One
EXPECTS it to cost (2*66 + 3*82 + 4 =) 382 nanoseconds per xy pair with
trig and sqrts, a bit more than twice as much.

>
>> In fact, your sqrt call is redundant
>
> Well, that's the deal: It isn't!

No, I meant the sqrt in the rejection conditional.  Replace it with e.g.

  if ((x*x + y*y) > 1.0) reject else accept

because it is much faster to check if r^2 is greater than 1.0 than it is
to take the sqrt to see if r is greater than 1.0.  Note that in
accept/reject the jacobian doesn't matter, and that r^2>1.0 if and only
if +sqrt(r^2)>1.0.  Note that you save ~100 nsec PER PAIR by making this
one change -- significantly speeding up your loop.

It really helps when optimizing code to have some idea what the
time-cost is of each commonly used operation in your code on your
architecture.  "Arithmetic" operations typically take order of 0.5-2
clock cycles -- order of a nanosecond or less -- except for (double
precision) division, which is over an order of magnitude slower on a 32
bit CPU.  On my laptop, for example, a raw multiply (or addition) costs
around 2.3 nsec, but a raw divide costs 35.8 nsec.  As we have seen,
trancendental calls cost order of 100 nsec.  The cost of GSL rand calls
varies considerably with algorithm -- I have written an entire GSL rand
timer just to be able to determine what that cost is.  E.g. (again on my
laptop) rndlxd2 is 875 nsec/rand vs 63 nsec/rand for mt19937_1999 vs 62
nsec for the (much lower quality) ran3, vs 44 nsec for random256-glibc2.
This variation is so profound that it is worthwhile to select the
fastest rng that meets your quality requirements.  As noted above, some
of the GSL generators are slow enough that they will dominate timings.

Note also that a lot of these rng's aren't reaching what one might
EXPECT in terms of optimized performance, I don't think.  Given the
number and kind of operations involved in generating a rand (typically
3-5 arithmetical operations), one would really expect many of the rng's
to cost only ~10-20 nsec each, even on my relatively sluggish 800 MHz
CPU.

Part of this is may be due to the fact that the rng's aren't (AFAIK)
internally vectorized and hence are incurring significant overhead just
going on and off of the CPU in the middle of your code. For optimal
performance I >>think<< that one would want to allocate a (say) 8K or
16K buffer per generic type (int or double) when initializing an rng,
fill the entire buffer with rands in a single loop, serve rands out of
this buffer until it empties, and then refill it "all at once" with
vectorized/optimized code.  The return for nearly all calls is thus just
a memory reference.  This minimizes all sorts of branching and keeps the
core code and variables on the CPU ALUs where a lot of it can be done in
a pipelined loop, possibly with SSE instructions, at less that a clock
cycle per arithmetical operation.  I've actually thought of trying to do
a performance hack of at least one of the GSL rngs in just this way to
see what the differential benefit is -- if it saved even 10 nsec each,
that would be a great boon to simulationists whose code is often
rate-limited by the generation of rands, and it might save a lot more
than that.

   rgb

>
> Without the sqrt, the functions are peaked at the center. With the
> sqrt they are uniform.
>
> I guess I am going with the reject anyway... Thanks for everybodies
> suggestions and help.
>
> Greetings,
> Jochen
> --
> Einigkeit und Recht und Freiheit                http://www.Jochen-Kuepper.de
>     Liberté, Égalité, Fraternité                GnuPG key: CC1B0B4D
>         (Part 3 you find in my messages before fall 2003.)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://sourceware.org/pipermail/gsl-discuss/attachments/20050809/aedc9e78/attachment.sig>