This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: [PATCH] Add math-inline benchmark
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: 'Ondřej Bílka' <neleai at seznam dot cz>
- Cc: "GNU C Library" <libc-alpha at sourceware dot org>
- Date: Mon, 6 Jul 2015 15:50:11 +0100
- Subject: RE: [PATCH] Add math-inline benchmark
- Authentication-results: sourceware.org; auth=none
- References: <001c01d0a912$42357710$c6a06530$ at com> <20150622083657 dot GA3684 at domone>
> Ondřej Bílka wrote:
> On Wed, Jun 17, 2015 at 04:28:27PM +0100, Wilco Dijkstra wrote:
> > Hi,
> >
> > Due to popular demand, here is a new benchmark that tests isinf, isnan,
> > isnormal, isfinite and fpclassify. It uses 2 arrays with 1024 doubles,
> > one with 99% finite FP numbers (10% zeroes, 10% negative) and 1% inf/NaN,
> > the other with 50% inf, and 50% Nan.
> >
> > Results shows that using the GCC built-ins in math.h will give huge speedups
> > due to avoiding explict calls, PLT indirection to execute a function with
> > 3-4 instructions. The GCC builtins have similar performance as the existing
> > math_private inlines for __isnan, __finite and __isinf_ns.
> >
> > OK for commit?
> >
> Ran these, on x64 using builtins is regression even with your benchmark.
>
> Main problem here is what exactly you do measure. I don't know how much
> of your results were caused by measuring latency of load/multiply/move
> to int register chain. With OoO that latency shouldn't be problem.
>
> Original results are following, when I also inlined isfinite:
>
> __fpclassify_test2_t: 3660.24 3733.22
> __fpclassify_test1_t: 3696.33 3691.3
> __fpclassify_t: 14365.8 11116.5
> fpclassify_t: 6045.69 3128.76
> __isnormal_inl2_t: 5275.85 14562.6
> __isnormal_inl_t: 14753.3 11143.5
> isnormal_t: 4418.84 4411.59
> __finite_inl_t: 3038.75 3038.4
> __finite_t: 7712.42 7697.24
> isfinite_t: 3108.91 3107.85
> __isinf_inl_t: 2109.05 2817.19
> __isinf_t: 8555.51 8559.36
> isinf_t: 3472.62 3408.8
> __isnan_inl_t: 2682.12 2691.39
> __isnan_t: 7698.14 7735.29
> isnan_t: 2592.58 2572.82
>
>
> But with latency hiding by using argument first suddenly even isnan and
> isnormal become regression.
>
> for (i = 0; i < n; i++){ res += 3*sin(p[i] * 2.0); \
> if (func (p[i] * 2.0)) res += 5;} \
>
>
> __fpclassify_test2_t: 92929.4 37256.8
> __fpclassify_test1_t: 94020.1 35512.1
> __fpclassify_t: 17321.2 13325.1
> fpclassify_t: 8021.29 4376.89
> __isnormal_inl2_t: 93896.9 38941.8
> __isnormal_inl_t: 98069.2 46140.4
> isnormal_t: 94775.6 36941.8
> __finite_inl_t: 84059.9 38304
> __finite_t: 96052.4 45998.2
> isfinite_t: 93371.5 36659.1
> __isinf_inl_t: 92532.9 36050.1
> __isinf_t: 95929.4 46585.2
> isinf_t: 93290.1 36445.6
> __isnan_inl_t: 82760.7 37452.2
> __isnan_t: 98064.6 45338.8
> isnan_t: 93386.7 37786.4
Can you try this with:
for (i = 0; i < n; i++) \
{ double tmp = p[i] * 2.0; \
if (sin(tmp) < 1.0) res++; if (func (tmp)) res += 5;} \
Basically GCC does the array read and multiply twice just like you told it
to (remember this is not using -ffast-math). Also avoid adding unnecessary
FP operations and conversions (which may interact badly with timing the
code we're trying to test).
For me the fixed version still shows the expected answer: the built-ins are
either faster or as fast as the inlines. So I don't think there is any
regression here (remember also that previously there were no inlines at all
except for a few inside GLIBC, so the real speedup is much larger).
__fpclassify_test2_t: 1.07
__fpclassify_test1_t: 1.07
__fpclassify_t: 1.24
fpclassify_t: 1
__isnormal_inl2_t: 1.11
__isnormal_inl_t: 1.24
isnormal_t: 1.04
__finite_inl_t: 1.04
__finite_t: 1.19
isfinite_t: 1
__isinf_inl_t: 1.07
__isinf_t: 1.22
isinf_t: 1
__isnan_inl_t: 1.04
__isnan_t: 1.14
isnan_t: 1
Wilco