This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Add math-inline benchmark

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Wilco Dijkstra <wdijkstr at arm dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Fri, 10 Jul 2015 20:13:23 +0200
Subject: Re: [PATCH] Add math-inline benchmark
Authentication-results: sourceware.org; auth=none
References: <001c01d0a912$42357710$c6a06530$ at com> <20150622083657 dot GA3684 at domone> <000701d0b7fb$0f27b840$2d7728c0$ at com> <20150709124454 dot GA29625 at domone> <001a01d0bb2a$c4f893b0$4ee9bb10$ at com>

On Fri, Jul 10, 2015 at 05:09:16PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Mon, Jul 06, 2015 at 03:50:11PM +0100, Wilco Dijkstra wrote:
> > >
> > >
> > > > OndÅej BÃlka wrote:
> > > > But with latency hiding by using argument first suddenly even isnan and
> > > > isnormal become regression.
> > > >
> 
> That doesn't look correct - it looks like this didn't use the built-ins at all,
> did you forget to apply that patch?
>
No, from what you wrote I expected that patch already tests builtins
which doesn't. Applied patch and got different results. When I added
patch results are similar.
 
> Anyway I received a new machine so now GLIBC finally builds for x64. Since 
> there appear large variations from run to run I repeat the same tests 4 times 
> by copying the FOR_EACH_IMPL loop. The first 1 or 2 are bad, the last 2 
> converge to useable results. So I suspect frequency scaling is an issue here.
> 
> Without the sin(tmp) part I get:
> 
>    remainder_test2_t:   40786.9 192862
>    remainder_test1_t:   43008.2 196311
> __fpclassify_test2_t:   2856.56 3020.12
> __fpclassify_test1_t:   3043.53 3135.89
>       __fpclassify_t:   12500.6 10152.5
>         fpclassify_t:   2972.54 3047.65
>    __isnormal_inl2_t:   4619.55 14491.1
>     __isnormal_inl_t:   12896.3 10306.7
>           isnormal_t:   4254.42 3667.87
>       __finite_inl_t:   3979.58 3991.6
>           __finite_t:   7039.61 7039.37
>           isfinite_t:   2992.65 2969.25
>        __isinf_inl_t:   2852.1  3239.23
>            __isinf_t:   8991.81 8813.44
>              isinf_t:   3241.75 3241.54
>        __isnan_inl_t:   4003.51 3977.73
>            __isnan_t:   7054.54 7054.5
>              isnan_t:   2819.66 2801.94
> 
> And with the sin() addition:
> 
>    remainder_test2_t:   105093  214635
>    remainder_test1_t:   106635  218012
> __fpclassify_test2_t:   64290.9 32116.6
> __fpclassify_test1_t:   64365.1 32310.2
>       __fpclassify_t:   72006.1 41607
>         fpclassify_t:   64190.3 33450.1
>    __isnormal_inl2_t:   65959.1 33672
>     __isnormal_inl_t:   71875.7 41727.3
>           isnormal_t:   65676.1 32826.1
>       __finite_inl_t:   69600.6 35293.3
>           __finite_t:   67653.8 38627.2
>           isfinite_t:   64435.9 34904.9
>        __isinf_inl_t:   68556.6 33176
>            __isinf_t:   69066.4 39562.7
>              isinf_t:   64755.5 34244.6
>        __isnan_inl_t:   69577.3 34776.2
>            __isnan_t:   67538.8 38321.3
>              isnan_t:   63963   33276.6
> 
> The remainder test is basically math/w_remainder.c adapted to use __isinf_inl
> and __isnan_inl (test1) or the isinf/isnan built-ins (test2).
> 
Which still doesn't have to mean anything, only if you test a
application that frequently uses these you will get result without
doubt.

Here a simple modification produces different results. One of many
objections is that by simply adding gcc will try to make branchless code
like converting that to res += 5 * (isnan(tmp)). So use more difficult
branch and and with following two I get __builtin_isinf lot slower.

    { double tmp = p[i] * 2.0;    \
       res += 3 * sin (tmp); if (func (tmp)) res += 3* sin (2 * tmp) ;} \

    { double tmp = p[i] * 2.0;    \
       if (func (tmp)) res += 3 * sin (2 * tmp) ;} \


__fpclassify_test2_t: 	4645.32	34549.8
__fpclassify_test1_t: 	4858.35	35682
      __fpclassify_t: 	12959.6	9402.96
        fpclassify_t: 	6163.18	3573.92
   __isnormal_inl2_t: 	72463.8	14535.8
    __isnormal_inl_t: 	78154.4	10665.7
          isnormal_t: 	72172.7	3978.68
      __finite_inl_t: 	69525.5	2048.91
          __finite_t: 	75383.9	10260.5
          isfinite_t: 	71913.3	3449.38
       __isinf_inl_t: 	2279.96	19166.1
           __isinf_t: 	10462.3	32308.5
             isinf_t: 	3706.35	22804.2
       __isnan_inl_t: 	2376.56	20769.1
           __isnan_t: 	9558.7	23683.9
             isnan_t: 	2025.91	19824.9

and

__fpclassify_test2_t: 	69720.7	70690
__fpclassify_test1_t: 	69575.2	57032
      __fpclassify_t: 	12839.1	9460.82
        fpclassify_t: 	6136.18	3355.79
   __isnormal_inl2_t: 	133754	34673.9
    __isnormal_inl_t: 	137851	41468.9
          isnormal_t: 	133421	37423.4
      __finite_inl_t: 	131153	34229.9
          __finite_t: 	135990	39609.5
          isfinite_t: 	132679	34604.8
       __isinf_inl_t: 	66901.3	44636.5
           __isinf_t: 	74330.2	53375.7
             isinf_t: 	68432.6	45350.5
       __isnan_inl_t: 	67125.3	43663.6
           __isnan_t: 	71864.2	49792.7
             isnan_t: 	67107.9	42478.3

> >From this it seems that __isinf_inl is slightly better than the builtin, but
> it does not show up as a regression when combined with sin or in the remainder
> test.
> 
That doesn't hold generaly as remainder test it could be just caused by
isnan being slower than isinf.

> So I don't see any potential regression here on x64 - in fact it looks like
> inlining using the built-ins gives quite good speedups across the board. And 
> besides inlining applications using GLIBC it also inlines a lot of callsites
> within GLIBC that weren't previously inlined.
> 
That doesn't mean anything as this is still poor benchmark. It only uses
tight loop which doesn't measure cost of loading constants for example.

> > > Basically GCC does the array read and multiply twice just like you told it
> > > to (remember this is not using -ffast-math). Also avoid adding unnecessary
> > > FP operations and conversions (which may interact badly with timing the
> > > code we're trying to test).
> > >
> > And how do you know that most users don't use fp conversions in their
> > code just before isinf? These interactions make benchtests worthless as
> > in practice a different variant would be faster than one that you
> > measure.
> 
> You always get such interactions, it's unavoidable. That's why I added some
> actual math code that uses isinf/isnan to see how it performs in real life.
> 
> > > For me the fixed version still shows the expected answer: the built-ins are
> > > either faster or as fast as the inlines. So I don't think there is any
> > > regression here (remember also that previously there were no inlines at all
> > > except for a few inside GLIBC, so the real speedup is much larger).
> > 
> > Thats arm only. So it looks that we need platform-specific headers and testing.
> 
> Well I just confirmed the same gains apply to x64.
> 
No, that doesn't confirm anything yet. You need to do more extensive
testing to get somewhat reliable answer and still you won't be sure. 

I asked you to run on arm my benchmark to measure results of inlining.
I attached again version. You should run it to see how results will differ.

Attachment: tf
Description: Text document

Attachment: ft.c
Description: Text document

Follow-Ups:
- RE: [PATCH] Add math-inline benchmark
  - From: Wilco Dijkstra

References:
- RE: [PATCH] Add math-inline benchmark
  - From: Wilco Dijkstra
- Re: [PATCH] Add math-inline benchmark
  - From: OndÅej BÃlka
- RE: [PATCH] Add math-inline benchmark
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]