This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] v11 Improves __ieee754_exp() performance by greater than 5x on sparc/x86.
I note Szabolcs is proposing to modify ieee754_exp() to
remove the Slow path. Since my proposed patch contains
substantial changes to ieee754_exp(), it makes sense to
only make one of these patches. I've done some data
collection comparing the patches for your consideration.
I've labeled the current code "Slow path", Szabolcs version "No Slow path"
and my version "Patrick's exp()".
Comparisons between Slow path, No Slow path, and Patrick's exp()
Existing code is assumed accurate with 0 ulp diffs. Removing the slow
path gets 1 error on the current "make check" test suite. Running ten
million numbers with each rounding mode shows removing the slow path
only gives an average of 4-5 1 ulp diffs per ten million tests. That is
extremely accurate still.
I also measured how often the slow path was taken for those same ten
million values. It was approximately 135 times per ten million tests
but usually returns the same value as the fast path. The counts are
slightly different for different rounding modes.
Patrick's exp() also only gets 1 error on the current "make check" test
the same test value as the "no slow path" code. It gets approximately
16000 1 ulp diffs per 10 million tests which is somewhat higher
than the "no slow path" code but still relatively rare.
sparc (nsec) x86 (nsec)
slow no slow patrick slow no slow patrick
max 17584 710 873 5158 299 275
min 399 398 96 15 15 15
mean 5497 538 419 1333 28 24
Repeated runs show about 2% variance for identical tests.
Notes: Removing the slow path is a huge performance win
on this set of values.
Patrick's version of exp() is 28% faster on Sparc and 14% faster on x86.
In addition, the existing code ("slow" and "no slow" versions) use
data tables with 13808 bytes for interpolation. Patrick's version
uses data tables with 3168 bytes for interpolation. It is hard
to predict what impact the extra 10K bytes might have on
real applications usage of L1 and L2 cache on various architectures.
Patrick's version could be modified to use larger data tables
to improve accuracy with no lose of performance in the glibc tests
but they would not approach the "no slow" accuracy levels.
Both the "no slow path" and "Patrick's exp()" show major performance
gains with relatively rare 1 ulp differences in results. The "no slow
path" has the advantage of errors being extremely rare while
"Patrick's exp()" has the advantage of being 14-28% faster.
Any thoughts on general principles on how to decide which patch
to accept, given both seem much more better than the existing code?
- Patrick McGehearty
On 2/8/2018 5:40 AM, Wilco Dijkstra wrote:
Has there been a serious discussion in the past of to what degree
of accuracy glibc/libm should support other rounding modes than
round-to-nearest? If a concensus decision were made that
other rounding modes were allowed slightly greater ulp diffs,
we could remove all the rounding mode checks and get
faster code. Failing that concensus, I don't see how we
can bypass the rounding mode checks for the generic code.
There have been various discussions, but nothing conclusive. I believe the
rounding mode changes can be removed from all the key math functions if we
accept 1 extra ULP in non-nearest rounding modes. As Szabolcs mentioned
there are some round-to-int idioms used by math functions which rely on a
specific rounding mode, but we can fix those.
If rounding errors in the more complex functions go up (some are very
sensitive to ULP), we could consider adding the rounding mode changes there -
that means you only do it where absolutely necessary, and also in cases where
the relative overhead is much lower.
Or alternatively we could agree that we don't have a requirement to optimize
math functions for absolute best possible ULP with different rounding modes,
and accept larger ULP errors.
I'll look into comparing removing the slow path on Sparc and
x86, including running my own "10 million values" test to
get a sense of how frequently the slow path is triggered
and what the largest relative error that test observes.
I'll also run timing tests.
Yes I noticed that even when the slow path doesn't trigger, it has a significant
overhead (log is 18% faster without the slow paths). Note we'll likely post patches
for removing slow paths in exp, pow as well.