CDF diff

Martin Jansche jansche@ling.ohio-state.edu
Thu Jan 16 02:35:00 GMT 2003


On Tue, 14 Jan 2003, Jason Hooper Stover wrote:

> I hadn't planned to compute Pr(a<Z<b) directly since such
> a function isn't a "cumulative distribution function"
> according to the common definition. Also, users can
> compute Pr(a<Z<b) as Pr(Z<b)-Pr(Z<a).

True, but what if I need Pr(a<Z<b) for a discrete distribution where
Pr(Z<b) has to be computed via an explicit summation (or does that
never happen?) and a>>0?  OK, I guess this could be pointed out in
documentation and one could then do the summation in user code.

> effort for a library like GSL. E.g., such an approach might make
> sense for a higher-level program that runs statistical routines
> for the exponential family.  But I could be wrong; maybe it would
> make sense in GSL.  If you have some coded examples of this kind
> of thing, send them along.

For example, you could define a type

typedef struct {
  double shape;
  double scale;
  double pdf_lognorm;
  double pdf_x0;
} gsl_ran_gamma_t;

with a corresponding constructor

gsl_ran_gamma_t *
gsl_ran_gamma_alloc (const double shape,
                     const double inverse_scale)
{
  gsl_ran_gamma_t *p;
  double scale;

  if (shape <= 0 || inverse_scale <= 0)
    GSL_ERROR_VAL ("requires shape > 0, inverse_scale > 0",
                   GSL_EDOM, NULL);

  p = malloc (sizeof (gsl_ran_gamma_t));
  if (p == NULL)
    GSL_ERROR_VAL ("failed to allocate space for ran_gamma struct",
                   GSL_ENOMEM, NULL);

  scale = 1 / inverse_scale;
  p->shape = shape;
  p->scale = scale;

  if (shape == 1)
    {
      p->pdf_lognorm = log (scale);
      p->pdf_x0 = scale;
    }
  else
    {
      p->pdf_lognorm = shape * log (scale) - gsl_sf_lngamma (shape);
      p->pdf_x0 = 0;
    }
  return p;
}

Then you can define gamma_pdf and gamma_cdf like this:

double
gsl_ran_gamma_pdf (const double x, gsl_ran_gamma_t *params)
{
  if (x < 0)
    {
      return 0;
    }
  else if (x == 0)
    {
      return params->pdf_x0;
    }
  else
    {
      double a = params->shape - 1;
      double logp = params->pdf_lognorm;
      logp -= params->scale * x;
      if (a != 0)
        logp += a * log (x);
      return exp (logp);
    }
}

double
gsl_ran_gamma_cdf (const double x, gsl_ran_gamma_t *params)
{
  if (x <= 0)
    {
      return 0;
    }
  else
    {
      double y = x * params->scale;
      if (params->shape == 1)
        return fabs (gsl_expm1 (-y));
      else
        return gsl_sf_gamma_inc_P (params->shape, y);
    }
}

This has the disadvantage of introducing yet another type and
allocator for each family, and so of course it breaks compatibility.
But it has several advantages *if you compute many values for the same
distribution with fixed parameters* (or if you repeatedly sample from
the same distribution), which is not too uncommon IMO:

  First, domain checks on the parameters have to be done exactly once
  in the _alloc() function.  This is an easy way to avoid code
  duplication, since the same checks would have to be carried out for
  the pdf, the cdf, and any other functionality that may be added in
  the future (inverse cdf, mode, etc.).

  Second, computing the normalization term is typically an expensive
  operation (e.g. binomial likelihood is easy to compute, but the
  corresponding Beta density is more costly), but does not depend on
  the value of x; in general, all values that are independent of x
  can be precomputed in the _alloc() function and stored in the
  parameter struct.  In the code above, lngamma(shape) is evaluated
  exactly once.

  Third, the functions now have a form in which they could be used
  directly in conjunction with gsl_function(_fdf), e.g. if a user
  wanted to use a one-dimensional root finding algorithm to compute
  inverse cdfs.

  Fourth, this format would make it easy to incorporate distributions
  that are simple reparametrizations of existing distributions.  For
  example, one could now have a function

    gsl_ran_gamma_t *gsl_ran_chisq_alloc(const double df);

  and simply not implement gsl_ran_chisq_pdf separately from
  gsl_ran_gamma_pdf.  (Or if this should be hidden from the user, it'd
  be easy to generate the illusion of separate functionality with
  #defines).

It would even be possible to do this in a somewhat object-oriented way
and include function pointers into the struct so that one would get a
a gamma density via:

  gsl_ran_gamma_t *g = gsl_ran_gamma_alloc (1, 1);
  double x = g->pdf (x, g);  /* presumably hidden behind a macro */

Notice that gsl_ran_gamma_alloc could have quietly initialized the pdf
field with a pointer to a function that computes an exponential
density (because shape == 1).

One potentail counterargument against this design might be that it
makes it more difficult for the user to implement likelihood based
computations, since the parameters of the distribution would
presumably change frequently.  I wouldn't consider this a valid
objection though, since for efficiently computing likelihood for many
samples the code has to be structured very differently anyway.

- martin



More information about the Gsl-discuss mailing list