This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication

From: Steven Munroe <munroesj at linux dot vnet dot ibm dot com>
To: Siddhesh Poyarekar <siddhesh at redhat dot com>
Cc: munroesj at us dot ibm dot com, libc-alpha at sourceware dot org
Date: Thu, 03 Jan 2013 11:09:25 -0600
Subject: Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication
References: <20121231092850.GA21621@spoyarek.pnq.redhat.com> <1357158013.19573.64.camel@spokane1.rchland.ibm.com> <20130103033814.GA5345@spoyarek.pnq.redhat.com> <1357229888.19573.84.camel@spokane1.rchland.ibm.com> <20130103164455.GA24416@spoyarek.pnq.redhat.com>
Reply-to: munroesj at us dot ibm dot com

On Thu, 2013-01-03 at 22:14 +0530, Siddhesh Poyarekar wrote:
> On Thu, Jan 03, 2013 at 10:18:08AM -0600, Steven Munroe wrote:
> > This is very bad for POWER. PowerPC has (multiple) independent fixed
> > point and floating point pipelines. This allow super-scalar out-of-order
> > execution, UNTIL you force a transfer (through memory) between the
> > FPRs/GPRs. PowerPC has lots of registers (32+32+32), we expect the
> > compiler to keep lots of data in the registers, and so we don't optimize
> > the hardware for dependent load after store, we optimize for memory
> > bandwidth.
> > 
> > You proposed code forces an (unnecessary) double->long conversion and
> > FPR to GPR transfer into the inner loop, disabling any super-scalar
> > parallel execution. It also prevents loop unrolling and does not allow
> > GCC to make good use of all those registers we provide in the
> > architecture.
> > 
> > So your code is optimized for (register poor, in-order-execution) X86 at
> > the expense of PowerPC.
> > 
> 
> I'm confused, which patch are you talking about, the current loop
> split patch or the conversion of mantissa to int or some other patch?
> I'll summarize the patches that are currently under review:
> 
I was referring to this code from your note:

> (2) u does not exist since it is replaced by a much simpler operation,
>     which results in that snippet looking like this:
> 
>     int64_t tmp = Z[k];
>     for (i=i1,j=i2-1; i<i2; i++,j--)
>       tmp += (int64_t) X[i]*Y[j];
> 
>     Z[k]  = (int) (tmp % (1 << 24));
>     Z[--k] = (int) (tmp / (1 << 24));

I do performance analysis and tuning for a living and this is an obvious
problem. oprofile will show this a hot spot.

Follow-Ups:
- Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication
  - From: Siddhesh Poyarekar

References:
- Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication
  - From: Steven Munroe
- Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication
  - From: Siddhesh Poyarekar
- Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication
  - From: Steven Munroe
- Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication
  - From: Siddhesh Poyarekar

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]