[PATCH] Inline C99 math functions

Wed Jun 17 17:21:00 GMT 2015

> Joseph Myers wrote:
> On Tue, 16 Jun 2015, Wilco Dijkstra wrote:
> 
> > > Well, the benchmark should come first....
> >
> > I added a new math-inlines benchmark based on the string benchmark
> > infrastructure.
> 
> Thanks.  I await the patch submission.

See https://sourceware.org/ml/libc-alpha/2015-06/msg00569.html

> > So this clearly shows the GCC built-ins win by a huge margin, including the
> > inline versions. It also shows that multiple isinf/isnan calls would be faster
> 
> That's interesting information - suggesting that changes in GCC to use
> integer arithmetic should be conditional on -fsignaling-nans, if doing the
> operations by integer arithmetic is slower (at least on this processor).
> 
> (It also suggests it's safe to remove the existing glibc-internal inlines
> as part of moving to using the built-in functions when possible.)

Indeed. To check which sequence is better we'd need to write a better benchmark,
maybe base it on a GLIBC function which uses these functions in the hot path.

> > > > Codesize of what? Few applications use these functions... GLIBC mathlib is
> > >
> > > Size of any code calling these macros (for nonconstant arguments).
> >
> > Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes
> > due to the callee-save overhead of the function call. The builtin isinf uses
> > 3 instructions inside the loop plus 3 lifted before it, while the call to
> > __isinf needs 3 plus a lot of code to save/restore the callee-saves.
> 
> One might suppose that most functions using these macros contain other
> function calls as well, and so that the callee-save overhead should not be
> included in the comparison.

That may be true in some cases, but if you can tailcall (which might be possible
in several math veneers) then the callee-save savings would apply.

> When you exclude callee-save overhead, how do things compare for
> fpclassify (the main case where inlining may be questionable when
> optimizing for size)?

Well in the worst-case scenario where you need all 5 tests of fpclassify it 
effectively changes a single-instruction call into 16 instructions plus 2 double 
immediate. So it is best to use OPTIMIZE_SIZE for fpclassify for now and revisit
when the GCC implementation has been improved. I also wonder what the difference
would be once I've optimized the __fpclassify implementation - I can do it in
about 8-9 instructions. 

Wilco