This is the mail archive of the gsl-discuss@sources.redhat.com mailing list for the GSL project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Multivariate minimization


On Tue, 26 Aug 2003, Fabrice Rossi wrote:

> Hi Robert.
> 
> In many applications, a local minimum is far enough. This is the case 
> for instance with neural networks. It is not uncommon to use neural 
> networks with more than 1000 numerical parameters and still to use 
> vector BFGS. You're perfectly right to say that GLOBAL optimization is a 
> major problem (and in fact no free lunch theorems show that even GA 
> cannot do anything about that), but this is not always what we want to do.

I'd have to disagree, if by that you mean "just" use a gradient search
of any sort.  Or rather, whether you can or not depends strictly on your
prior knowledge of the complexity properties of the multivariate
objective function and its information-theoretic content.  In the
general case where the objective function is unknown and impossible to
functionally analyze but can still be evaluated by some procedure (e.g.
from training data and a set of trial weights in the case of a neural
network), without doing SOME sort of global search one cannot make any
statement at all about the RELATIVE quality of a minimum obtained from a
single gradient descent.

This is easily illustrated with a toy problem.  Consider the "divide-by"
problem.  If one a binary representation of integers with (say) N = 16
bits, creates a training set out of a fraction of the integers that are
and aren't divisible by (say) m = 7, it is possible to train a
classification network using a training set using, say, half the
integers divisible by 7 and an equal number of numbers that aren't that
can correctly classify 99% of the integers divisible by 7 (getting a
mere handful wrong) in the entire set of 16 bit ints, which is very cool
in and of itself.  However, you aren't going to find the nets that
"solve" for this highly nonlinear function with a gradient descent from
a single random set of starting weights.  It requires quite a lot of
intelligent exploration to find starting networks that, when improved by
gradient descent, achieve the nearly perfect resolution of the problem.

This is interesting because in this case, you know precisely the
information content of the inputs: all sixteen bits.  One cannot
generally tell if a binary number is divisible by seven until all of its
digits are known; all the intermediate weights from the 16 inputs have
to work together in a single, fully correlated pattern to produce the
correct result.  There is no obvious decomposition of the problem into
simpler units where the network can do less global work and trivially
combine the results at the end.  

This solution lives in and is smoothly connected to only a tiny fraction
of the available phase volume.  Even starting off with a GA I've found
that it STILL takes multiple runs and a better-than-average GA to have a
good chance of finding one of the "solution" networks.  So forget 1000's
of variables -- gradient descent alone won't generally work even for an
easily understandable toy problem with only 16 inputs and a few tens of
hidden layer weight variables when the problem has no decompositions of
lower dimensionality.

Compare this to the divide by >>m = 2<< problem in binary (with an
obvious single-bit decomposition) where a gradient descent from nearly
anywhere will find a really perfect solution. This illustrates fairly
directly how decompositions that in general are not a priori known can
have a huge influence on whether a given methodology works at all and is
robust even for a simple "divide by" problem.

A third example from this toy problem illustrates how the methodology
used can easily de facto projectively reduce the problem's
dimensionality -- if one trains on division by 14, there exists a
trivial decomposition that immediately ELIMINATES all odd numbers
(improving classification by close to a factor of 2 compared to random
chance without trying hard at all).  Gradient descent will find this
part of the solution from nearly anywhere, and will likely obtain
something that is nearly noise on the issue of whether the remaining
even numbers are ALSO divisible by seven (it is still solving the
division by seven problem, but only on the 15 remaining bits).  One thus
sees a significant improvement in classification ability relative to
random chance, but one is (actually) rather far away from solving the
problem, although it might take you thousands of runs to learn that.

This last example shows very clearly what a neural network trained on a
general problem using only a single gradient descent is almost certainly
doing.  It is finding a problem decomposition of much lower
dimensionality that yields an "easy benefit" by ignoring all but a few
degrees of freedom -- going quickly downhill where the going is easy,
then twisting around in the volume of phase space doing a local
optimization of the nearly irrelevant remaining variables.

To put it another way, only when you already know a great deal about the
information content, complexity and decompositions of the PARTICULAR
nonlinear problem from prior explorations do you get enough information
about the distribution of "distinct" valley minima in the nonlinear
function itself to be say whether the minimum obtained from a single
trial is likely to be any good.  You cannot get this information in the
most general case without "wasting" computational energy exploring
global solutions before expending the rest to locally optimize.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]