This is the mail archive of the gsl-discuss@sourceware.org mailing list for the GSL project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: array math.h (SSE2?)


> James Bergstra wrote:
> > Does anyone know of a library, or source file somewhere in which functions from
> > math.h such as exp() and log() are implemented using SSE2?  AMD's libacml_mv
> > demonstrates that such an implementation can be faster than gcc's, and
> > furthermore that multiple(2) functions can be computed in parallel on a single
> > processor. (Making 2-3 fold speed improvements on array calculations).
> > 
> > Or, taking a step back, is there a project analagous to ATLAS or FFTW that
> > provides fast array computations of these functions on various architectures?
> > 
On Sat, Mar 25, 2006 at 02:41:56PM +0000, John D Lamb wrote:
> It wouldn't be too hard to write some of these. As an example,
> #define SQRT( x, result ) ({ \
>       __asm__( "fsqrt\n\t" \
> 	       : "=t"( result ) \
> 	       : "0"( x ) ); })
> is about 2-3 times as fast as std::sqrt and (I think) gives identical
> results for doubles other than std::numeric_limits<infinity>() (which is
> easily fixed).
Thank you for your suggestion.  The catch I see when extending this to exp()
(which is the function I want most!) is that the x86 instructions handle one
number at a time, and that becomes the bottleneck.
Do you know if anyone has implemented anything like, for eg., the method
described in :

"Evaluation of Elementary Functions using Multimedia Features", Parallel and
Distributed Processing Symposium, 2004?

> However, two issues:
> 1. GSL is pretty well optimised if you use the appropriate flags for
> pentium 4 SSE2: -march=pentium4 -malign-double -mfpmath=sse -msse -msse2
Good point, I'm not sure if I did this.  Maybe a big oversight!

> 2. Since gsl allows vector and matrix views, the doubles (assuming
> you're using doubles) you might want to add are not necessarily stored
> in contiguous memory, which limits the use of movapd and so limits the
> value of addsd, subsd and mulsd to speed up calculations.
True, but at the same time, the matrix format *does* guarantee contiguous rows,
and I think matrices and vectors are contiguous often enough to warrant
attention. (And I found movntpd to be most helpful of all... does GCC ever
generate this instruction?)

> Of course, there's nothing to stop you writing your own functions that
> operate on gsl_blocks and give you much faster arithmetic.
That's basically what I'm doing... although I thought putting gsl_block* in the
function declaration would make it inconvenient.  Just 
fn (size_t dim, double*data)

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]