CDF diff

Jason Hooper Stover jason@sakla.net
Thu Jan 16 14:05:00 GMT 2003


On Wed, Jan 15, 2003 at 09:34:55PM -0500, Martin Jansche wrote:
> On Tue, 14 Jan 2003, Jason Hooper Stover wrote:
> 
> > I hadn't planned to compute Pr(a<Z<b) directly since such
> > a function isn't a "cumulative distribution function"
> > according to the common definition. Also, users can
> > compute Pr(a<Z<b) as Pr(Z<b)-Pr(Z<a).
> 
> True, but what if I need Pr(a<Z<b) for a discrete distribution where
> Pr(Z<b) has to be computed via an explicit summation (or does that
> never happen?) and a>>0?  OK, I guess this could be pointed out in
> documentation and one could then do the summation in user code.

Okay, now I see what you mean. I'd add code to handle this situation
but only after all the usual cdf's (computed as Pr(Z<t) or Pr(Z>t)
had been added to the cdf module.

I read about your design below. I agree with Brian Gough and
would like stick with basic types for the module. Your design 
looks good to me for some purposes, but that I can think of routines 
someone might write that uses cdf that your design might hinder. 
Since I don't know how exactly people will use cdf, I can't 
favor one user's design if it might hinder another's use 
of the module. The cost is the use of basic types everywhere, 
which might not be too elegant, but it does keep the library 
general and easier to learn to use. 

The cdf routines aren't in the release version 
yet anyway, and after they are, if users make it clear that 
such a design beats one which uses only basic types, I (or whoever 
maintains it) can change the code. Also, your design does
require users to learn the different types for each distribution, and
I have a hunch they won't want to learn any more than they have to
use the module.

-Jason

> > effort for a library like GSL. E.g., such an approach might make
> > sense for a higher-level program that runs statistical routines
> > for the exponential family.  But I could be wrong; maybe it would
> > make sense in GSL.  If you have some coded examples of this kind
> > of thing, send them along.
> 
> For example, you could define a type
> 
> typedef struct {
>   double shape;
>   double scale;
>   double pdf_lognorm;
>   double pdf_x0;
> } gsl_ran_gamma_t;
> 
> with a corresponding constructor
> 
> gsl_ran_gamma_t *
> gsl_ran_gamma_alloc (const double shape,
>                      const double inverse_scale)
> {
>   gsl_ran_gamma_t *p;
>   double scale;
> 
>   if (shape <= 0 || inverse_scale <= 0)
>     GSL_ERROR_VAL ("requires shape > 0, inverse_scale > 0",
>                    GSL_EDOM, NULL);
> 
>   p = malloc (sizeof (gsl_ran_gamma_t));
>   if (p == NULL)
>     GSL_ERROR_VAL ("failed to allocate space for ran_gamma struct",
>                    GSL_ENOMEM, NULL);
> 
>   scale = 1 / inverse_scale;
>   p->shape = shape;
>   p->scale = scale;
> 
>   if (shape == 1)
>     {
>       p->pdf_lognorm = log (scale);
>       p->pdf_x0 = scale;
>     }
>   else
>     {
>       p->pdf_lognorm = shape * log (scale) - gsl_sf_lngamma (shape);
>       p->pdf_x0 = 0;
>     }
>   return p;
> }
> 
> Then you can define gamma_pdf and gamma_cdf like this:
> 
> double
> gsl_ran_gamma_pdf (const double x, gsl_ran_gamma_t *params)
> {
>   if (x < 0)
>     {
>       return 0;
>     }
>   else if (x == 0)
>     {
>       return params->pdf_x0;
>     }
>   else
>     {
>       double a = params->shape - 1;
>       double logp = params->pdf_lognorm;
>       logp -= params->scale * x;
>       if (a != 0)
>         logp += a * log (x);
>       return exp (logp);
>     }
> }
> 
> double
> gsl_ran_gamma_cdf (const double x, gsl_ran_gamma_t *params)
> {
>   if (x <= 0)
>     {
>       return 0;
>     }
>   else
>     {
>       double y = x * params->scale;
>       if (params->shape == 1)
>         return fabs (gsl_expm1 (-y));
>       else
>         return gsl_sf_gamma_inc_P (params->shape, y);
>     }
> }
> 
> This has the disadvantage of introducing yet another type and
> allocator for each family, and so of course it breaks compatibility.
> But it has several advantages *if you compute many values for the same
> distribution with fixed parameters* (or if you repeatedly sample from
> the same distribution), which is not too uncommon IMO:
> 
>   First, domain checks on the parameters have to be done exactly once
>   in the _alloc() function.  This is an easy way to avoid code
>   duplication, since the same checks would have to be carried out for
>   the pdf, the cdf, and any other functionality that may be added in
>   the future (inverse cdf, mode, etc.).
> 
>   Second, computing the normalization term is typically an expensive
>   operation (e.g. binomial likelihood is easy to compute, but the
>   corresponding Beta density is more costly), but does not depend on
>   the value of x; in general, all values that are independent of x
>   can be precomputed in the _alloc() function and stored in the
>   parameter struct.  In the code above, lngamma(shape) is evaluated
>   exactly once.
> 
>   Third, the functions now have a form in which they could be used
>   directly in conjunction with gsl_function(_fdf), e.g. if a user
>   wanted to use a one-dimensional root finding algorithm to compute
>   inverse cdfs.
> 
>   Fourth, this format would make it easy to incorporate distributions
>   that are simple reparametrizations of existing distributions.  For
>   example, one could now have a function
> 
>     gsl_ran_gamma_t *gsl_ran_chisq_alloc(const double df);
> 
>   and simply not implement gsl_ran_chisq_pdf separately from
>   gsl_ran_gamma_pdf.  (Or if this should be hidden from the user, it'd
>   be easy to generate the illusion of separate functionality with
>   #defines).
> 
> It would even be possible to do this in a somewhat object-oriented way
> and include function pointers into the struct so that one would get a
> a gamma density via:
> 
>   gsl_ran_gamma_t *g = gsl_ran_gamma_alloc (1, 1);
>   double x = g->pdf (x, g);  /* presumably hidden behind a macro */
> 
> Notice that gsl_ran_gamma_alloc could have quietly initialized the pdf
> field with a pointer to a function that computes an exponential
> density (because shape == 1).
> 
> One potentail counterargument against this design might be that it
> makes it more difficult for the user to implement likelihood based
> computations, since the parameters of the distribution would
> presumably change frequently.  I wouldn't consider this a valid
> objection though, since for efficiently computing likelihood for many
> samples the code has to be structured very differently anyway.
> 
> - martin



More information about the Gsl-discuss mailing list