Robert G. Brown
Wed Mar 29 01:20:00 GMT 2006

On Tue, 28 Mar 2006, C J Kenneth Tan -- OptimaNumerics wrote:

> As I have mentioned above, the BLAS layer is kept the same in each of the
> benchmarks, therefore the benchmarks show the differences in LAPACK
> performance.

Well, clearly that isn't NECESSARILY true.  Especially not if the BLAS
layer is doing 80% or 90% of the actual CPU-driven work for a given
call, as it is for many calls so that one can obtain the aforementioned
factor of 2-3 speedup that many people have reported for ATLAS-tuned
BLAS based LAPACK.  If BLAS variation alone can drop timings to 30% of
their worst case values, then reducing the non-BLAS code fraction timing
to ZERO would still leave you with only 30% speedup relative to a bad
BLAS, if you understand what I'm saying.

If you disagree with this, well it only makes the point more strongly --
it means that yes, LAPACK structure and BLAS structure ARE strongly
coupled so that testing LAPACK variations "holding BLAS constant" is
pointless, if you're moving the LAPACK in question into tune with the
BLAS in question and both have plenty of room for improvement.

Not that I'm disagreeing with the idea that different LAPACKs have
different performance or that yours is probably better -- only with the
idea that you can separate out LAPACK-specific performance issues from
the compiler, BLAS layer, linker and application/benchmark program by
"just" keeping some or all of them the same while varying the LAPACK
code.  Programming is way too nonlinear and coupled for that to be

I'd also be very suspicious of 1000% effects ANYWHERE in real-world
programming (and yes, very especially in LAPACK with a common BLAS
layer:-) -- that smacks of superlinear speedup or something that is true
but useless for 90% of all programmers -- or some truly terrible code,
ALGORITHMICALLY terrible and not just terrible from the point of view of
hand tuning some loops.

Not that I don't wish it were true and possible...I just never have
found it to be so except when fixing really boneheaded code... nothing
commercial grade, nothing competent programmer grade.  Although sure,
some LAPACKs may be very poor indeed here and effectively flog the cache
with their strides, just as non-ATLAS BLAS can easily do.

>> Or if you like, a longer term question that is quite worth bringing up
>> is to what extent it is desireable to introduce architecture specific
>> tuning into actual GSL code, period.
> Is this considered scalable from a development perspective?

I'd guess not, mostly.  But it's really up to the developers, right?

>> On the one hand, it makes the code ugly and extremely difficult to
>> maintain and offers small marginal performance gains in most places BUT
>> linear algebra.  It tends to depend heavily on compiler as well -- if
>> you write code optimized for e.g.  presumed SSE1 or SSE2 instructions
>> your compiler has to support then, not just your CPU.  It makes the code
>> in a sense less portable.  It is "expensive" in human time and developer
>> time in the GSL is probably largely donated and a precious resource.
>> I'd rather give up 20-30% in "potential" performance and use the GSL for
>> free, personally, blessing the developers and their contribution to
>> world civilization as I do so. (Thanks, guys!:-)
>> On the other hand, for some applications tuning CAN be a big, big
>> performance booster -- 2-3x -- and hence very valuable.  I'm working
>> (depending on whether and how much support I get) on a portable/open way
>> of writing at least moderately autotuning HPC code.  Something that will
>> get "most" of the benefit of a full ATLAS-like autotuning build without
>> architecture-specific instrumentation.  Don't hold your breath, but in a
>> year or three maybe this problem can be at least attacked without
>> playing the library juggling game.
> Is this approach scalable when it comes to highly complex scientific
> code, as opposed to the comparatively more straight forward BLAS level
> code?

You're addressing research-level computer science questions, really.
>From what I've seen, the GSL is already pretty decent in its algorithm
implementations as far as efficiency is concerned.  It isn't necessarily
"optimized", but it is good, competent code that won't run badly on
pretty much anything.  The ALGORITHMS are good, and are well

To do better, an algorithm needs to be architecture aware to an extent
that few programs are, or really can be.  It needs portable "easy
access" to certain critical performance-determining metrics at runtime
initialization, so it can set e.g. stride sizes and perhaps change
algorithms depending on the measured details of the CPU/memory/IO bus
collective -- viewed as a set of sizes, latencies, bandwidths and so on.
ATLAS does what amounts to an empirical search of a multiparameter space
at build time, but in principle that space could be pre-searched --
mapped out in general terms -- and a smaller set of predictors
determined for setting parameters "close to" their optima.  You might
not get 100% of the possible speedup, but I'm guessing that you can get
80-90%, and from a conversation I had with Jack Dongarra it sounds like
this is what at least one researcher addressing the question is finding

Note well that ATLAS/BLAS is close to the limit of algorithmical
complexity here -- most algorithms one might wish to tune are likely to
be simpler, many of them setting a single parameter for stride or the
blocking of a vector based on e.g. a measured or automatically
determined knowledge of the L1/L2 cache sizes and speeds relative to
main memory.  An example of one way GSL might be retuned for
optimization using information of this sort is in random number

Many RNGs use a very simple algorithm to actually generate each
successive number -- some additions, multiplications, moduli, done.
Many RNG users use LOTS AND LOTS of RNGs in a computation.  However, the
RNG interface in the GSL tends to generate them one at a time with a
subroutine/library call layer in between.  It might be possible, for
example, for the library to at least have a parameter that one could set
at initialization time that tells it to create a VECTOR BUFFER of rands
of some suitable length -- say a page, or several pages -- and to go
ahead and fill it on the initial call using SSE-optimized vector code.
Then simply return each successive number on demand and increment a
static pointer in the vector until it runs out and then do it again.

This would likely never run more SLOWLY than single-number-per-call
code, per RNG returned, although it makes the timing of those calls
"odd".  For a MC person like myself, I could care less as long as that
average time per call goes down -- I use numbers like 10^18 of them in a
long running computation and it's my life that's awasting in the

What's needed to support this sort of thing?  First, some evidence that
there is marginal benefit worth the effort in coding.  Second, you
probably wouldn't do it anywhere but the "bread and butter" RNGs -- the
ones that are the best, fastest, pass diehard, are the default or
equally good alternatives.  Why speed up any of the RNGs that are
obviously flawed and there only so people can compute and compare the
flaws?  Third some idea that the optimization you can do is PORTABLE and
will actually improve performance on e.g. SPARC, i386, X86_64, etc with
at most small parametric tunings that can be done at runtime based on
the aforementioned metrics.

That might not be so bad or impossible to maintain, and if there were
good software support for obtaining the right metrics, they might prove
useful to ordinary numerical programmers in normal code instances --
LOTS of people do loops filled with arithmetic, few people have the
tools or energy or knowledge to be able to determine how to block out
those loops to run fast.

>> In the meantime, hey, the GSL is open source, full GPL, viral.  That
>> means that as long as the LAPACK API is preserved -- or a wrapper of the
>> standard lapack calls provided -- you can ALWAYS hack and link your own
>> LAPACK in.  The effort in any event is WAY less than the effort required
>> to actually build an architecture-dependent LAPACK in the first place,
>> and you can freely market a "tuned" version of the GSL as long as you
>> provide its sources, and it seems to me that in the case of lapack this
>> is as MOST developing a wrapper.  I've done similar things already for
>> adding e.g. rng's to the GSL -- it isn't difficult.  Or am I missing
>> something?
> Yes, we agree on this.  This is what I think is a good option.  It is
> crucial that LAPACK API is preserved.

Interesting discussion:-)

A lot of it is similar to parallel performance discussions on the
beowulf list.  There are easy ways to parallelize some things, but if
you really want to get serious you may need to e.g. get Ian Foster's
book (or one of the other good parallel programming books, e.g. Amalfi)
and learn about parallel algorithms.  Then you find that yeah, your
"obvious" algorithm gave you terrible speedup for reasons that in
retrospect are equally obvious.

Performance tuning in general is SO nonlinear and SO tightly coupled to
architecture, compiler choice, library set.  Then there is algorithmic
robustness to consider (don't want to get fast, incorrect answers as
legend has it one can easily do with Cell processors if you aren't
careful).  Most real numerical code I personally have written has been a
compromise, with efficiency subjugated to portability, long term
maintainability, and not getting wrong answers to the extent I can
control it with numerical error to contend with.  I think that these are
sensible design criteria for the GSL as well, where of course one would
hope to use efficient algorithms whereever possible within these


Robert G. Brown	             
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525

More information about the Gsl-discuss mailing list