This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v2] Add math-inline benchmark
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Wilco Dijkstra <wdijkstr at arm dot com>
- Cc: 'GNU C Library' <libc-alpha at sourceware dot org>
- Date: Thu, 23 Jul 2015 17:39:41 +0200
- Subject: Re: [PATCH v2] Add math-inline benchmark
- Authentication-results: sourceware.org; auth=none
- References: <002001d0bfb8$b36fa330$1a4ee990$ at com> <20150716225056 dot GA24479 at domone> <002501d0c094$3ea04cd0$bbe0e670$ at com> <20150718113423 dot GC30356 at domone> <002a01d0c2db$7ad80c30$70882490$ at com> <20150720192216 dot GA2019 at domone> <003001d0c3d0$5fb0a390$1f11eab0$ at com> <20150721174732 dot GB22893 at domone> <003301d0c3e8$c8276b80$58764280$ at com>
On Tue, Jul 21, 2015 at 08:09:35PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Tue, Jul 21, 2015 at 05:14:51PM +0100, Wilco Dijkstra wrote:
> > > > OndÅej BÃlka wrote:
> > > > On Mon, Jul 20, 2015 at 12:01:50PM +0100, Wilco Dijkstra wrote:
> > > > > > OndÅej BÃlka wrote:
> > > > > > On Fri, Jul 17, 2015 at 02:26:53PM +0100, Wilco Dijkstra wrote:
> > > > > > But you claimed following in original mail which is wrong:
> > > > > >
> > > > > > "
> > > > > > Results shows that using the GCC built-ins in math.h gives huge speedups due to
> > avoiding
> > > > > > explict
> > > > > > calls, PLT indirection to execute a function with 3-4 instructions - around 7x on
> > AArch64
> > > > and
> > > > > > 2.8x
> > > > > > on x64. The GCC builtins have better performance than the existing math_private
> > inlines
> > > > for
> > > > > > __isnan,
> > > > > > __finite and __isinf_ns, so these should be removed.
> > > > > > "
> > > > >
> > > > > No that statement is 100% correct.
> > > > >
> > > > As for isinf_ns on some architectures current isinf inline is better so
> > > > it should be replaced by that instead.
> > >
> > > The current inline (__isinf_ns) is not better than the builtin.
> > >
> > Thats correct as I also said that. But it doesn't change my point.
> > isinf_ns is slower than builtin. builtin is on x64 slower that current
> > inline.
> >
> > So we should replace internal isinf_ns with current inline to
> > get better performance.
>
> Sure it is certainly possible to improve upon the built-ins. But that
> was never the goal of my patch, so it is not relevant to this discussion.
>
As in libc current math-private inlines are better there is no reason to
use internally builtins. Until you will convince us that builtins result
in faster code these shouldn't be used internally.
> > > > Your claim about __finite builtin is definitely false on x64, its
> > > > slower:
> > >
> > > No that's not what I am getting on x64. With the movq instruction
> > > and inlining as my original patch:
> > >
> > Thats only your patch. It isn't clear at all if that happens in reality.
> > My benchmark shows that it don't.
> >
> > So Carlos we have two benchmarks each saying that one alternative is
> > better than other, how you solve that?
>
> You'd solve it by showing a particular inline is faster than the builtin on
> some GLIBC math functions and propose a patch.
>
I told you at start of thread that you should convince us by testing,
for example ctan function. Remeber that its your responsibility to show
that change leads to better result.
> > > > > It's obvious the huge speedup applies to all other architectures as well -
> > > > > it's hard to imagine that avoiding a call, a return, a PLT indirection and
> > > > > additional optimization of 3-4 instructions could ever cause a slowdown...
> > > > >
> > > > But I didn't asked about that. I asked that you made bug x64 specific
> > > > but its not clear all all if other architectures are affected or not. So
> > > > you must test these.
> > >
> > > All architectures are affected as they will get the speedup from the inlining.
> > >
> > Again this is irrelevant. Current inlines could be faster than builtins
> > on most architectures so you should test that and open gcc bugs.
>
> Since the patch has been out for review, people have been able to test it for
> their architecture. Anyone can suggest further improvements if they want even more
> performance. There is already a GCC bug to improve the built-ins.
>
So do it. I ask you again to run my benchmark on arm to see if builtins
could be improved or not.
> > > > > > So I ask you again to run my benchmark with changed EXTRACT_WORDS64 to
> > > > > > see if this is problem also and arm.
> > > > >
> > > > > Here are the results for x64 with inlining disabled (__always_inline changed
> > > > > into noinline) and the movq instruction like you suggested:
> > > > >
> > > > I asked to run my benchmark on arm, not your benchmark on x64. As you described
> > > > modifications it does measure just performance of noninline function
> > > > which we don't want. There is difference in performance how gcc
> > > > optimizes that so you must surround entire expression in noinline. For
> > > > example gcc optimizes
> > > >
> > > > # define isinf(x) (noninf(x) ? (x == 1.0 / 0.0 ? 1 : 0))
> > > > if (isinf(x))
> > > > foo()
> > > >
> > > > into
> > > >
> > > > if (!noninf)
> > > > foo()
> > >
> > > Which is exactly what we want to happen (and why the non-inlined isinf results
> > > are not interesting as 99.9% of times we do the optimization).
> > >
> > That isn't completely correct. While this happens most callers do is*
> > check only once at start so loading constants costs you. So you need to
> > be careful how you write benchmark to get correct result like I did in
> > mine.
>
> Actually almost all uses within GLIBC use the macros multiple times within the
> same function, so the immediates can be shared (sometimes even between 2 different
> macros, and any fabs or FP->int moves can be shared too).
>
There is difference of using macro several times where overhead is
smaller and in loop where there is no overhead.
As for sharing its complex, selection I don't know better than select
float/int implementation on case-by-case basis as condition what is
faster are complex.
A basic example that integer implementation could be faster even if
floating is faster in general case is
if (isinf(x))
foo();
else if (isnan(x))
bar();
else
baz();
As for isinf you use check if (2 * to_int (x) == c) and for isnan (2 * to_int (x) > c) then gcc transforms this into
one comparison instruction of 2 * to_int (x) >= c
> > > > > "isnan_t": {
> > > > > "normal": {
> > > > > "duration": 1.50514e+06,
> > > > > "iterations": 500,
> > > > > "mean": 3010
> > > > > }
> > > > > },
> > > >
> > > > why is isnan faster than builtin?
> > >
> > > You asked for results with inlining turned off, so basically this shows the difference
> > > between inlining the isnan builtin into a loop and calling a function which has the
> > > isnan builtin inlined.
> > >
> > I never asked to do that. I said that you should measure expression
> > using isinf surrounded in noinline function to avoid reusing constants
> > in loop.
> >
> > That will give you different results than you get.
>
> No that wouldn't make any difference - all that matters for that experiment was
> encapsulating the immediate in a separate function, you'll get one call every
> iteration either way.
>
Thats completely false as my implementation shows that different
implementation is faster than yours when I use noinline with specific
expressions. For example these could be invalid as gcc decides to turn
all implementations into branchless code which completely changes
performance characteristic.