This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] v11 Improves __ieee754_exp() performance by greater than 5x on sparc/x86.
- From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- To: Patrick McGehearty <patrick dot mcgehearty at oracle dot com>, libc-alpha at sourceware dot org
- Cc: nd at arm dot com
- Date: Thu, 22 Feb 2018 19:22:03 +0000
- Subject: Re: [PATCH] v11 Improves __ieee754_exp() performance by greater than 5x on sparc/x86.
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Szabolcs dot Nagy at arm dot com;
- Nodisclaimer: True
- References: <DB6PR0801MB2053765A646D2DCBE02B666883F30@DB6PR0801MB2053.eurprd08.prod.outlook.com> <email@example.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
On 14/02/18 01:18, Patrick McGehearty wrote:
I note Szabolcs is proposing to modify ieee754_exp() to
remove the Slow path. Since my proposed patch contains
substantial changes to ieee754_exp(), it makes sense to
only make one of these patches. I've done some data
collection comparing the patches for your consideration.
I've labeled the current code "Slow path", Szabolcs version "No Slow path"
and my version "Patrick's exp()".
Comparisons between Slow path, No Slow path, and Patrick's exp()
Existing code is assumed accurate with 0 ulp diffs. Removing the slow
path gets 1 error on the current "make check" test suite. Running ten
million numbers with each rounding mode shows removing the slow path
only gives an average of 4-5 1 ulp diffs per ten million tests. That is
extremely accurate still.
I also measured how often the slow path was taken for those same ten
million values. It was approximately 135 times per ten million tests
but usually returns the same value as the fast path. The counts are
slightly different for different rounding modes.
Patrick's exp() also only gets 1 error on the current "make check" test suite,
the same test value as the "no slow path" code. It gets approximately
16000 1 ulp diffs per 10 million tests which is somewhat higher
than the "no slow path" code but still relatively rare.
sparc (nsec) x86 (nsec)
slow no slow patrick slow no slow patrick
max 17584 710 873 5158 299 275
min 399 398 96 15 15 15
mean 5497 538 419 1333 28 24
Repeated runs show about 2% variance for identical tests.
Notes: Removing the slow path is a huge performance win
on this set of values.
Patrick's version of exp() is 28% faster on Sparc and 14% faster on x86.
In addition, the existing code ("slow" and "no slow" versions) use
data tables with 13808 bytes for interpolation. Patrick's version
uses data tables with 3168 bytes for interpolation. It is hard
to predict what impact the extra 10K bytes might have on
real applications usage of L1 and L2 cache on various architectures.
Patrick's version could be modified to use larger data tables
to improve accuracy with no lose of performance in the glibc tests
but they would not approach the "no slow" accuracy levels.
did some more work on exp.
the 'patrick' version uses different methods for small values
(< 3/2 ln2) and larger ones.
previously i benchmarked with large values, on those the current
glibc code (no slow) is actually faster than patrick on aarch64.
when i benchmark with small values (i suspect that's more common
in practice) then the patrick version is reasonably fast.
i use a single method (nsz exp): on larger inputs it's about 30%
latency improvement compared to noslow and patrick, on small
values i get a tiny bit better latency than patrick (2-3%).
however that relies on having single instruction, rounding mode
independent toint (aarch64), when i change the code to be portable
then it is slower on small values compared to patrick (almost 10%),
on large values it's still about 25% faster.
so i think i have something that's good for aarch64 and i think
it may be an improvement on all targets compared to noslow,
but it's not better than patrick version for small values on
(i removed rounding mode settings from patrick, noslow and nsz
that should be valid for nsz exp and i think for patrick too,
i don't remember why the rounding mode changes were needed there)
it needs a bit more work still before i can post something.
Both the "no slow path" and "Patrick's exp()" show major performance
gains with relatively rare 1 ulp differences in results. The "no slow
path" has the advantage of errors being extremely rare while
"Patrick's exp()" has the advantage of being 14-28% faster.
Any thoughts on general principles on how to decide which patch
to accept, given both seem much more better than the existing code?
- Patrick McGehearty