This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: [PATCH] Add math-inline benchmark
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: 'OndÅej BÃlka' <neleai at seznam dot cz>
- Cc: "GNU C Library" <libc-alpha at sourceware dot org>
- Date: Mon, 13 Jul 2015 12:02:51 +0100
- Subject: RE: [PATCH] Add math-inline benchmark
- Authentication-results: sourceware.org; auth=none
- References: <001c01d0a912$42357710$c6a06530$ at com> <20150622083657 dot GA3684 at domone> <000701d0b7fb$0f27b840$2d7728c0$ at com> <20150709124454 dot GA29625 at domone> <001a01d0bb2a$c4f893b0$4ee9bb10$ at com> <20150710181111 dot GA27786 at domone>
> OndÅej BÃlka wrote:
> On Fri, Jul 10, 2015 at 05:09:16PM +0100, Wilco Dijkstra wrote:
> > > OndÅej BÃlka wrote:
> > > On Mon, Jul 06, 2015 at 03:50:11PM +0100, Wilco Dijkstra wrote:
> > > >
> > > >
> > > > > OndÅej BÃlka wrote:
> > > > > But with latency hiding by using argument first suddenly even isnan and
> > > > > isnormal become regression.
> > > > >
> >
> > That doesn't look correct - it looks like this didn't use the built-ins at all,
> > did you forget to apply that patch?
> >
> No, from what you wrote I expected that patch already tests builtins
> which doesn't. Applied patch and got different results. When I added
> patch results are similar.
OK, I extended the benchmark to add the built-ins explicitly so that
you don't need to apply the math.h inline patch first.
> Which still doesn't have to mean anything, only if you test a
> application that frequently uses these you will get result without
> doubt.
We don't have applications that uses these, but we can say without any
doubt that they will show huge speedups if they do use these functions
frequently or any math functions that use them a lot. Remainder() for
example shows ~7% gain with the new inlines.
> Here a simple modification produces different results. One of many
> objections is that by simply adding gcc will try to make branchless code
> like converting that to res += 5 * (isnan(tmp)). So use more difficult
> branch and and with following two I get __builtin_isinf lot slower.
>
> { double tmp = p[i] * 2.0; \
> res += 3 * sin (tmp); if (func (tmp)) res += 3* sin (2 * tmp) ;} \
>
> { double tmp = p[i] * 2.0; \
> if (func (tmp)) res += 3 * sin (2 * tmp) ;} \
So here are the results again for the original test and your 2 tests above:
remainder_test2_t: 40966.3 192314
remainder_test1_t: 43697.4 196474
__fpclassify_t: 12665.2 9951.16
fpclassify_t: 2979.56 2974.35
__fpclassify_test2_t: 2889.92 2984.95
__fpclassify_test1_t: 3269.67 3199.05
isnormal_t: 4381.54 4041.78
__isnormal_builtin_t: 4586.15 4318.18
__isnormal_inl2_t: 4371.76 10737.4
__isnormal_inl_t: 12635.5 10418.4
isfinite_t: 2992.79 2979.5
__isfinite_builtin_t: 2982.96 2982.92
__finite_inl_t: 4090.2 4064.52
__finite_t: 7058.1 7039.74
isinf_t: 3274.14 3299.75
__isinf_builtin_t: 3195.79 3196.05
__isinf_ns_builtin_t: 3241.91 3241.96
__isinf_ns_t: 3500.85 3493.8
__isinf_inl_t: 2834.83 3433.89
__isinf_t: 8794.62 8812.5
isnan_t: 2801.83 2801.67
__isnan_builtin_t: 2794.7 2891.37
__isnan_inl_t: 4216.83 3980.52
__isnan_t: 7070.36 7088.15
remainder_test2_t: 105654 239008
remainder_test1_t: 107533 239310
__fpclassify_t: 12523.5 10080.2
fpclassify_t: 2974.47 2983.21
__fpclassify_test2_t: 64227.5 55564.1
__fpclassify_test1_t: 64036.1 55424
isnormal_t: 122300 34529.5
__isnormal_builtin_t: 122592 34616
__isnormal_inl2_t: 123425 35056
__isnormal_inl_t: 129589 41615.3
isfinite_t: 123254 34041.5
__isfinite_builtin_t: 123302 34093
__finite_inl_t: 123455 34631.8
__finite_t: 127298 39587.5
isinf_t: 63744 45997.6
__isinf_builtin_t: 63545.2 46100.2
__isinf_ns_builtin_t: 63570.9 46087.5
__isinf_ns_t: 63890.9 45754.5
__isinf_inl_t: 64008.5 46505.2
__isinf_t: 68915.7 51833.8
isnan_t: 62866.8 45023.5
__isnan_builtin_t: 62951.9 44956.8
__isnan_inl_t: 63855.1 45294.6
__isnan_t: 67156.5 49505.3
remainder_test2_t: 41147.4 216349
remainder_test1_t: 43860.8 220614
__fpclassify_t: 12569.1 10124.3
fpclassify_t: 3068.91 2974.31
__fpclassify_test2_t: 4048.88 32446.5
__fpclassify_test1_t: 4005.86 32783.1
isnormal_t: 63707.9 14550.8
__isnormal_builtin_t: 63672 14383
__isnormal_inl2_t: 65730.4 15059.1
__isnormal_inl_t: 73352.2 10570.7
isfinite_t: 64756.7 2719.12
__isfinite_builtin_t: 64748.8 2664.12
__finite_inl_t: 65331.4 2740.1
__finite_t: 70374.5 7944.69
isinf_t: 2927.67 20684.2
__isinf_builtin_t: 2848.58 20050.9
__isinf_ns_builtin_t: 2932.22 21809.9
__isinf_ns_t: 2908.41 18973.9
__isinf_inl_t: 2971.63 18025.5
__isinf_t: 9010.58 28392.3
isnan_t: 2841.28 16457.3
__isnan_builtin_t: 2841.34 15017.8
__isnan_inl_t: 2846.25 19736.8
__isnan_t: 8171.32 23874.3
> > >From this it seems that __isinf_inl is slightly better than the builtin, but
> > it does not show up as a regression when combined with sin or in the remainder
> > test.
> >
> That doesn't hold generaly as remainder test it could be just caused by
> isnan being slower than isinf.
No, the new isinf/isnan are both faster than the previous versions (some isinf
calls were inlined as __isinf_ns, but even that one is clearly slower than the
builtin in all the results). Remember once again this patch creates new inlines
that didn't exist before as well as replacing existing inlines in GLIBC with
even faster ones. The combination of these means it is simply an impossibility
that anything could become slower.
> > Well I just confirmed the same gains apply to x64.
> >
> No, that doesn't confirm anything yet. You need to do more extensive
> testing to get somewhat reliable answer and still you won't be sure.
No, this benchmark does give a very clear and reliable answer: everything
speeds up by a huge factor.
> I asked you to run on arm my benchmark to measure results of inlining.
> I attached again version. You should run it to see how results will differ.
I did run it but I don't understand what it's supposed to mean, and I can't share
the results. So do you have something simpler that shows what point you're trying
to make? Or maybe you could add your own benchmark to GLIBC?
Wilco