This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Split mantissa calculation loop and add branch predictionto mp multiplication


On 01/03/2013 05:54 PM, Steven Munroe wrote:
On Thu, 2013-01-03 at 17:19 +0100, Andreas Jaeger wrote:
On 01/03/2013 05:18 PM, Steven Munroe wrote:
On Thu, 2013-01-03 at 09:08 +0530, Siddhesh Poyarekar wrote:
On Wed, Jan 02, 2013 at 02:20:13PM -0600, Steven Munroe wrote:
I do not understand what you are doing here. If the intent is to replace
the X[], Y[], Z[] doubles with int's you will get overflows in Z[] if
you are changing X[], y[]. Z[] with uint64_t then you avoid the
overflows but (Z[k] + CUTTER)-CUTTER has no effect and you have not
saved any space. Also u is still a double, so you are adding some
expensive int->float->int converts to the inter loop.

I don't convert mantissa to int and leave everything as is. I had posted the patch to do that earlier, which has not been commented upon yet and that's the one you should be looking at; this patch has a different purpose:

http://sourceware.org/ml/libc-alpha/2012-12/msg00354.html

None of the problems you're claiming will exist because:

(1) The product is computed and stored in 64-bit

(2) u does not exist since it is replaced by a much simpler operation,
      which results in that snippet looking like this:

      int64_t tmp = Z[k];
      for (i=i1,j=i2-1; i<i2; i++,j--)
        tmp += (int64_t) X[i]*Y[j];

      Z[k]  = (int) (tmp % (1 << 24));
      Z[--k] = (int) (tmp / (1 << 24));

This is very bad for POWER. PowerPC has (multiple) independent fixed
point and floating point pipelines. This allow super-scalar out-of-order
execution, UNTIL you force a transfer (through memory) between the
FPRs/GPRs. PowerPC has lots of registers (32+32+32), we expect the
compiler to keep lots of data in the registers, and so we don't optimize
the hardware for dependent load after store, we optimize for memory
bandwidth.

You proposed code forces an (unnecessary) double->long conversion and
FPR to GPR transfer into the inner loop, disabling any super-scalar
parallel execution. It also prevents loop unrolling and does not allow
GCC to make good use of all those registers we provide in the
architecture.

So your code is optimized for (register poor, in-order-execution) X86 at
the expense of PowerPC.


Steve, could you run the testprogram that Siddesh has mentioned and show
the numbers with and without the patch, please? I'd like to see the
actual numbers.


Actually I think it is up to Siddhesh to prove that his code does not negatively impact other platforms.

But in case somebody has not access to PowerPC, he should be able to ask the PowerPC maintainers for testing ;)


Andreas
--
 Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
  SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
   GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
    GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]