This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [PATCH] Inline C99 math functions


> Ondřej Bílka wrote:
> On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote:
> > Even with the inlined fpclassify (inl2 below), isnormal is slower:
> >
> > __isnormal_inl2_t:	1.25	3.67
> > __isnormal_inl_t:	4.59	2.89
> > isnormal_t:	1	1
> >
> > So using a dedicated builtin for isnormal is important.
> >
> That makes result identical to one of isnan. That its slower is bug in
> fpclassify which should first check for normal, then do unlikely checks.

It's about twice as slow as isnan as the isnormal check isn't done efficiently.
Fpclassify is slower still as it does 3 comparisons before setting FP_NORMAL.

> > > > It's certainly correct, but obviously different microarchitectures will show
> > > > different results. Note the GLIBC private inlines are not particularly good.
> > > >
> > > No, problem is that different benchmarks show different results on same
> > > architecture. To speed things up run following to test all cases of
> > > environment. Run attached tf script to get results on arm.
> >
> > I tried, but I don't think this is a good benchmark - you're not measuring
> > the FP->int move for the branched version, and you're comparing the signed
> > version of isinf vs the builtin which does isinf_ns.
> >
> Wilco that isinf is signed is again completely irrelevant. gcc is smart.
> It get expanded into
> if (foo ? (bar ? 1 : -1) : 0)
> that gcc simplifies to
> if (foo)
> so it doesn't matter that checking sign would take 100 cycles as its
> deleted code.

It matters for the -DT3 test.

> Also as it could make branched version only slower when it beats builtin
> then also nonsigned one would beat builtin.
> 
> And do you have assembly to show it doesn't measure move or its just
> your guess? On x64 it definitely measures move and I could add that gcc
> messes that bit by moving several times. objdump -d
> on gcc   -DT2 -DBRANCHED  -DI1="__attribute((always_inline))" -
> DI2="__attribute__((always_inline))" ft.c   -c
> clearly shows that conversion is done several times.

Look at what the inner loop generates for T1 (T2/T3 do the same):

.L27:
        ldr     x2, [x1]
        cmp     x4, x2, lsl 1
        bne     .L29
        fadd    d0, d0, d1
.L29:
        add     x1, x1, 8
        cmp     x1, x3
        bne     .L27

> So will you publish results or not as it would show your builtins in
> unfavorable ligth?

I don't see the point until your benchmark measures the right thing. 
Note my benchmark carefully avoids this gotcha.

Wilco




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]