This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] v11 Improves __ieee754_exp() performance by greater than 5x on sparc/x86.

On 14/02/18 01:18, Patrick McGehearty wrote:
I note Szabolcs is proposing to modify ieee754_exp() to
remove the Slow path. Since my proposed patch contains
substantial changes to ieee754_exp(), it makes sense to
only make one of these patches. I've done some data
collection comparing the patches for your consideration.

I've labeled the current code "Slow path", Szabolcs version "No Slow path"
and my version "Patrick's exp()".

Comparisons between Slow path, No Slow path, and Patrick's exp()

Existing code is assumed accurate with 0 ulp diffs.  Removing the slow
path gets 1 error on the current "make check" test suite.  Running ten
million numbers with each rounding mode shows removing the slow path
only gives an average of 4-5 1 ulp diffs per ten million tests. That is
extremely accurate still.

I also measured how often the slow path was taken for those same ten
million values. It was approximately 135 times per ten million tests
but usually returns the same value as the fast path.  The counts are
slightly different for different rounding modes.

Patrick's exp() also only gets 1 error on the current "make check" test suite,
the same test value as the "no slow path" code. It gets approximately
16000 1 ulp diffs per 10 million tests which is somewhat higher
than the "no slow path" code but still relatively rare.


       sparc (nsec)                   x86 (nsec)
        slow   no slow  patrick      slow  no slow  patrick
max   17584     710     873        5158     299      275
min     399     398      96          15      15       15
mean   5497     538     419        1333      28       24
Repeated runs show about 2% variance for identical tests.

Notes: Removing the slow path is a huge performance win
on this set of values.
Patrick's version of exp() is 28% faster on Sparc and 14% faster on x86.

In addition, the existing code ("slow" and "no slow" versions) use
data tables with 13808 bytes for interpolation. Patrick's version
uses data tables with 3168 bytes for interpolation. It is hard
to predict what impact the extra 10K bytes might have on
real applications usage of L1 and L2 cache on various architectures.
Patrick's version could be modified to use larger data tables
to improve accuracy with no lose of performance in the glibc tests
but they would not approach the "no slow" accuracy levels.

did some more work on exp.

the 'patrick' version uses different methods for small values
(< 3/2 ln2) and larger ones.

previously i benchmarked with large values, on those the current
glibc code (no slow) is actually faster than patrick on aarch64.

when i benchmark with small values (i suspect that's more common
in practice) then the patrick version is reasonably fast.

i use a single method (nsz exp): on larger inputs it's about 30%
latency improvement compared to noslow and patrick, on small
values i get a tiny bit better latency than patrick (2-3%).

however that relies on having single instruction, rounding mode
independent toint (aarch64), when i change the code to be portable
then it is slower on small values compared to patrick (almost 10%),
on large values it's still about 25% faster.

so i think i have something that's good for aarch64 and i think
it may be an improvement on all targets compared to noslow,
but it's not better than patrick version for small values on
most targets.

(i removed rounding mode settings from patrick, noslow and nsz
that should be valid for nsz exp and i think for patrick too,
i don't remember why the rounding mode changes were needed there)

it needs a bit more work still before i can post something.

Both the "no slow path" and "Patrick's exp()" show major performance
gains with relatively rare 1 ulp differences in results. The "no slow
path" has the advantage of errors being extremely rare while
"Patrick's exp()" has the advantage of being 14-28% faster.

Any thoughts on general principles on how to decide which patch
to accept, given both seem much more better than the existing code?

- Patrick McGehearty

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]