This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: [PATCH v2] Add math-inline benchmark
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: 'OndÅej BÃlka' <neleai at seznam dot cz>
- Cc: "'GNU C Library'" <libc-alpha at sourceware dot org>
- Date: Mon, 20 Jul 2015 12:01:50 +0100
- Subject: RE: [PATCH v2] Add math-inline benchmark
- Authentication-results: sourceware.org; auth=none
- References: <002001d0bfb8$b36fa330$1a4ee990$ at com> <20150716225056 dot GA24479 at domone> <002501d0c094$3ea04cd0$bbe0e670$ at com> <20150718113423 dot GC30356 at domone>
> OndÅej BÃlka wrote:
> On Fri, Jul 17, 2015 at 02:26:53PM +0100, Wilco Dijkstra wrote:
> > > OndÅej BÃlka wrote:
> > > On Thu, Jul 16, 2015 at 12:15:19PM +0100, Wilco Dijkstra wrote:
> > > > Add a benchmark for isinf/isnan/isnormal/isfinite/fpclassify. This new version adds
> explicit
> > > tests
> > > > for the GCC built-ins and uses json format as suggested and no longer includes any
> string
> > > headers.
> > > > The test uses 2 arrays with 1024 doubles, one with 99% finite FP numbers (10% zeroes,
> 10%
> > > negative)
> > > > and 1% inf/NaN, the other with 50% inf, and 50% Nan.
> > > >
> > > > Results shows that using the GCC built-ins in math.h gives huge speedups due to avoiding
> > > explict
> > > > calls, PLT indirection to execute a function with 3-4 instructions - around 7x on
> AArch64
> > > and 2.8x
> > > > on x64. The GCC builtins have better performance than the existing math_private inlines
> for
> > > __isnan,
> > > > __finite and __isinf_ns, so these should be removed.
> > > >
> > > No, this benchmark is invalid for following two reasons.
> > >
> > > 1) It doesn't measure real workload at all. Constructing large constant
> > > could be costy and by inlining this benchmark ignores cost.
> >
> > It was never meant to measure a real workload. If you'd like to add your own
> > workload, that would be great, but my micro benchmark is more than sufficient
> > in proving that the new inlines give a huge performance gain.
> >
> But you claimed following in original mail which is wrong:
>
> "
> Results shows that using the GCC built-ins in math.h gives huge speedups due to avoiding
> explict
> calls, PLT indirection to execute a function with 3-4 instructions - around 7x on AArch64 and
> 2.8x
> on x64. The GCC builtins have better performance than the existing math_private inlines for
> __isnan,
> __finite and __isinf_ns, so these should be removed.
> "
No that statement is 100% correct.
> Also when inlines give speedup you should also add math inlines for
> signaling nan case. That gives similar speedup. And it would be natural
> to ask if you should use these inlines everytime if they are already
> faster than builtins.
I'm not sure what you mean here - I enable the new inlines in exactly the
right case. Improvements to support signalling NaNs or to speedup the
built-ins further will be done in GCC.
> > > So at least on x64 we should publish math_private inlines instead using
> > > slow builtins.
> >
> > Well it was agreed we are going to use the GCC built-ins and then improve
> > those. If you want to propose additional patches with special inlines for
> > x64 then please go ahead, but my plan is to improve the builtins.
> >
> And how are you sure that its just isolated x64 case. It may also happen
> on powerpc, arm, sparc and other architectures and you need to test
> that.
It's obvious the huge speedup applies to all other architectures as well -
it's hard to imagine that avoiding a call, a return, a PLT indirection and
additional optimization of 3-4 instructions could ever cause a slowdown...
> So I ask you again to run my benchmark with changed EXTRACT_WORDS64 to
> see if this is problem also and arm.
Here are the results for x64 with inlining disabled (__always_inline changed
into noinline) and the movq instruction like you suggested:
"__isnan_t": {
"normal": {
"duration": 3.52048e+06,
"iterations": 500,
"mean": 7040
}
},
"__isnan_inl_t": {
"normal": {
"duration": 3.09247e+06,
"iterations": 500,
"mean": 6184
}
},
"__isnan_builtin_t": {
"normal": {
"duration": 2.20378e+06,
"iterations": 500,
"mean": 4407
}
},
"isnan_t": {
"normal": {
"duration": 1.50514e+06,
"iterations": 500,
"mean": 3010
}
},
"__isinf_t": {
"normal": {
"duration": 4.54681e+06,
"iterations": 500,
"mean": 9093
}
},
"__isinf_inl_t": {
"normal": {
"duration": 3.09981e+06,
"iterations": 500,
"mean": 6199
}
},
"__isinf_ns_t": {
"normal": {
"duration": 3.08074e+06,
"iterations": 500,
"mean": 6161
}
},
"__isinf_ns_builtin_t": {
"normal": {
"duration": 2.64185e+06,
"iterations": 500,
"mean": 5283
}
},
"__isinf_builtin_t": {
"normal": {
"duration": 3.13833e+06,
"iterations": 500,
"mean": 6276
}
},
"isinf_t": {
"normal": {
"duration": 1.60055e+06,
"iterations": 500,
"mean": 3201
}
},
"__finite_t": {
"normal": {
"duration": 3.54966e+06,
"iterations": 500,
"mean": 7099
}
},
"__finite_inl_t": {
"normal": {
"duration": 3.08112e+06,
"iterations": 500,
"mean": 6162
}
},
"__isfinite_builtin_t": {
"normal": {
"duration": 2.6426e+06,
"iterations": 500,
"mean": 5285
}
},
"isfinite_t": {
"normal": {
"duration": 1.49071e+06,
"iterations": 500,
"mean": 2981
}
},
"__isnormal_inl_t": {
"normal": {
"duration": 6.31925e+06,
"iterations": 500,
"mean": 12638
}
},
"__isnormal_inl2_t": {
"normal": {
"duration": 2.18113e+06,
"iterations": 500,
"mean": 4362
}
},
"__isnormal_builtin_t": {
"normal": {
"duration": 3.08183e+06,
"iterations": 500,
"mean": 6163
}
},
"isnormal_t": {
"normal": {
"duration": 2.19867e+06,
"iterations": 500,
"mean": 4397
}
},
"__fpclassify_test1_t": {
"normal": {
"duration": 2.71552e+06,
"iterations": 500,
"mean": 5431
}
},
"__fpclassify_test2_t": {
"normal": {
"duration": 2.69459e+06,
"iterations": 500,
"mean": 5389
}
},
"__fpclassify_t": {
"normal": {
"duration": 6.42558e+06,
"iterations": 500,
"mean": 12851
}
},
"fpclassify_t": {
"normal": {
"duration": 1.4907e+06,
"iterations": 500,
"mean": 2981
}
},
"remainder_test1_t": {
"normal": {
"duration": 2.20506e+07,
"iterations": 500,
"mean": 44101
}
},
"remainder_test2_t": {
"normal": {
"duration": 2.12782e+07,
"iterations": 500,
"mean": 42556
}
}