This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] Add math-inline benchmark


On Tue, Jul 21, 2015 at 08:09:35PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Tue, Jul 21, 2015 at 05:14:51PM +0100, Wilco Dijkstra wrote:
> > > > OndÅej BÃlka wrote:
> > > > On Mon, Jul 20, 2015 at 12:01:50PM +0100, Wilco Dijkstra wrote:
> > > > > > OndÅej BÃlka wrote:
> > > > > > On Fri, Jul 17, 2015 at 02:26:53PM +0100, Wilco Dijkstra wrote:
> > > > > > But you claimed following in original mail which is wrong:
> > > > > >
> > > > > > "
> > > > > > Results shows that using the GCC built-ins in math.h gives huge speedups due to
> > avoiding
> > > > > > explict
> > > > > > calls, PLT indirection to execute a function with 3-4 instructions - around 7x on
> > AArch64
> > > > and
> > > > > > 2.8x
> > > > > > on x64. The GCC builtins have better performance than the existing math_private
> > inlines
> > > > for
> > > > > > __isnan,
> > > > > > __finite and __isinf_ns, so these should be removed.
> > > > > > "
> > > > >
> > > > > No that statement is 100% correct.
> > > > >
> > > > As for isinf_ns on some architectures current isinf inline is better so
> > > > it should be replaced by that instead.
> > >
> > > The current inline (__isinf_ns) is not better than the builtin.
> > >
> > Thats correct as I also said that. But it doesn't change my point.
> > isinf_ns is slower than builtin. builtin is on x64 slower that current
> > inline. 
> >
> > So we should replace internal isinf_ns with current inline to
> > get better performance.
> 
> Sure it is certainly possible to improve upon the built-ins. But that
> was never the goal of my patch, so it is not relevant to this discussion.
>
As in libc current math-private inlines are better there is no reason to
use internally builtins. Until you will convince us that builtins result
in faster code these shouldn't be used internally.
 
> > > > Your claim about __finite builtin is definitely false on x64, its
> > > > slower:
> > >
> > > No that's not what I am getting on x64. With the movq instruction
> > > and inlining as my original patch:
> > >
> > Thats only your patch. It isn't clear at all if that happens in reality.
> > My benchmark shows that it don't.
> > 
> > So Carlos we have two benchmarks each saying that one alternative is
> > better than other, how you solve that?
> 
> You'd solve it by showing a particular inline is faster than the builtin on
> some GLIBC math functions and propose a patch.
> 
I told you at start of thread that you should convince us by testing,
for example ctan function. Remeber that its your responsibility to show
that change leads to better result.

> > > > > It's obvious the huge speedup applies to all other architectures as well -
> > > > > it's hard to imagine that avoiding a call, a return, a PLT indirection and
> > > > > additional optimization of 3-4 instructions could ever cause a slowdown...
> > > > >
> > > > But I didn't asked about that. I asked that you made bug x64 specific
> > > > but its not clear all all if other architectures are affected or not. So
> > > > you must test these.
> > >
> > > All architectures are affected as they will get the speedup from the inlining.
> > >
> > Again this is irrelevant. Current inlines could be faster than builtins
> > on most architectures so you should test that and open gcc bugs.
> 
> Since the patch has been out for review, people have been able to test it for 
> their architecture. Anyone can suggest further improvements if they want even more 
> performance. There is already a GCC bug to improve the built-ins.
>
So do it. I ask you again to run my benchmark on arm to see if builtins
could be improved or not.
 
> > > > > > So I ask you again to run my benchmark with changed EXTRACT_WORDS64 to
> > > > > > see if this is problem also and arm.
> > > > >
> > > > > Here are the results for x64 with inlining disabled (__always_inline changed
> > > > > into noinline) and the movq instruction like you suggested:
> > > > >
> > > > I asked to run my benchmark on arm, not your benchmark on x64. As you described
> > > > modifications it does measure just performance of noninline function
> > > > which we don't want. There is difference in performance how gcc
> > > > optimizes that so you must surround entire expression in noinline. For
> > > > example gcc optimizes
> > > >
> > > > # define isinf(x) (noninf(x) ? (x == 1.0 / 0.0 ? 1 : 0))
> > > > if (isinf(x))
> > > >   foo()
> > > >
> > > > into
> > > >
> > > > if (!noninf)
> > > >   foo()
> > >
> > > Which is exactly what we want to happen (and why the non-inlined isinf results
> > > are not interesting as 99.9% of times we do the optimization).
> > >
> > That isn't completely correct. While this happens most callers do is*
> > check only once at start so loading constants costs you. So you need to
> > be careful how you write benchmark to get correct result like I did in
> > mine.
> 
> Actually almost all uses within GLIBC use the macros multiple times within the 
> same function, so the immediates can be shared (sometimes even between 2 different
> macros, and any fabs or FP->int moves can be shared too).
>
There is difference of using macro several times where overhead is
smaller and in loop where there is no overhead.

As for sharing its complex, selection I don't know better than select
float/int implementation on case-by-case basis as condition what is
faster are complex.

A basic example that integer implementation could be faster even if
floating is faster in general case is

if (isinf(x))
  foo();
else if (isnan(x))
  bar();
else
  baz();

As for isinf you use check if (2 * to_int (x) == c) and for isnan (2 * to_int (x) > c) then gcc transforms this into
one comparison instruction of 2 * to_int (x) >= c


 
> > > > >    "isnan_t": {
> > > > >     "normal": {
> > > > >      "duration": 1.50514e+06,
> > > > >      "iterations": 500,
> > > > >      "mean": 3010
> > > > >     }
> > > > >    },
> > > >
> > > > why is isnan faster than builtin?
> > >
> > > You asked for results with inlining turned off, so basically this shows the difference
> > > between inlining the isnan builtin into a loop and calling a function which has the
> > > isnan builtin inlined.
> > >
> > I never asked to do that. I said that you should measure expression
> > using isinf surrounded in noinline function to avoid reusing constants
> > in loop.
> >
> > That will give you different results than you get.
> 
> No that wouldn't make any difference - all that matters for that experiment was 
> encapsulating the immediate in a separate function, you'll get one call every 
> iteration either way.
> 
Thats completely false as my implementation shows that different
implementation is faster than yours when I use noinline with specific
expressions. For example these could be invalid as gcc decides to turn
all implementations into branchless code which completely changes
performance characteristic.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]