Ondrej Bilka [Mon, 20 May 2013 06:26:00 +0000 (08:26 +0200)]
Faster memset on x64
This implementation speed up memset in several ways. First is avoiding
expensive computed jump. Second is using fact that arguments of memset
are most of time aligned to 8 bytes.
Benchmark results on:
kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_result27_04_13.tar.bz2
Ondrej Bilka [Mon, 20 May 2013 06:20:00 +0000 (08:20 +0200)]
Faster memcpy on x64.
We add new memcpy version that uses unaligned loads which are fast
on modern processors. This allows second improvement which is avoiding
computed jump which is relatively expensive operation.
Tests available here:
http://kam.mff.cuni.cz/~ondra/memcpy_profile_result27_04_13.tar.bz2
The compiler would optimize the benchmark function call out of the
loop and call it only once, resulting in blazingly fast times for some
benchmarks (notably atan, sin and cos). Mark the inputs as volatile
so that the code is forced to read again from the input for each
iteration.
[BZ #15442] This adds support for the inverse interpretation of the
quiet bit of IEEE 754 floating-point NaN data that some processors
use. This includes in particular MIPS architecture processors; the
payload used for the canonical qNaN encoding is updated accordingly
so as not to interfere with the quiet bit.
Carlos O'Donell [Wed, 15 May 2013 21:19:20 +0000 (17:19 -0400)]
hppa: Cleanup libm-test-ulps.
Joseph Myers noted that there were several old and really very
incorrect values in the hppa libm-test-ulps. This patch removes
all of the ulps values for ceil, floor, rint, round, trun,
llrint, and llround, all of which were previously incorreclty
added (including some negative values which are really wrong).
---
ports/
2013-05-15 Carlos O'Donell <carlos@redhat.com>
* sysdeps/hppa/fpu/libm-test-ulps: Remove old values for ceil, floor,
rint, round, trunc, llrint, and llround.
Use x constraints for operands to vfmaddss and vfmaddsd
While these instructions accept memory operands, only one operand
may be a memory operand. Giving two operands xm constraints gives
the compiler the option of using memory for both operands, which
would result in invalid assembly code. Using x for all operands is
more appropriate, as most x86_64 calling conventions will pass the
arguments in registers anyway.
2013-05-15 Peter Collingbourne <pcc@google.com>
* sysdeps/x86_64/fpu/multiarch/s_fma.c (__fma_fma4): Replace xm
constraints with x constraints.
* sysdeps/x86_64/fpu/multiarch/s_fmaf.c (__fmaf_fma4): Likewise.
it is impossible to create an alias of a common symbol (as
compat_symbol does), because common symbols do not have a section or
an offset until linked. GNU as tolerates aliases of common symbols by
simply creating another common symbol, but other assemblers (notably
LLVM's integrated assembler) are less tolerant.
Carlos O'Donell [Wed, 15 May 2013 16:42:59 +0000 (12:42 -0400)]
hppa: Update libm-test-ulps
Update libm-test-ulps for hppa. There are a few entries
with 4 or 5 ulps, but these appear to be expected. A more
thorough review will be required if hppa switches long-double
to a different type.
Carlos O'Donell [Wed, 15 May 2013 15:47:47 +0000 (11:47 -0400)]
hppa: Fix _FPU_GETCW and _FPU_SETCW.
The following patch fixes both _FPU_GETCW and
_FPU_SETCW for hppa. The initial implementation was
flawed and not well tested. We failed to set cw,
and passed in the value of a register to fldd.
This patch fixes both of those errors and allows
the libm tests to pass without failure.
Signed-off-by: Guy Martin <gmsoft@tuxicoman.be> Signed-off-by: Carlos O'Donell <carlos@redhat.com>
---
2013-05-15 Guy Martin <gmsoft@tuxicoman.be>
Carlos O'Donell <carlos@redhat.com>
[BZ# 15000]
* ports/sysdeps/hppa/fpu/fpu_control.h (_FPU_GETCW): Set cw.
(_FPU_SETCW): Pass address to fldd.
Carlos O'Donell [Tue, 14 May 2013 04:06:35 +0000 (00:06 -0400)]
Add comments to vDSO hwcap loading process.
Loading of the vDSO pseudo-hwcap from the type 2 GNU note is
a rather arcane and poorly documented process. Given that I had
a chance to review this code today I thought I would add all
of the things I had to lookup to verify the validity of the
process.
With a single .note.GNU the vDSO can register up to 64 flags,
though in practice you are limited to 64 - _DL_FIRST_EXTRA
bits which on x86 is 12 bits.
The only use of this that I know of is in the Xen support
in Linux where they use the 1st bit to indicate "nosegneg".
I see "We use bit 1 to avoid bugs in some versions of glibc
when bit 0 is used; the choice is otherwise arbitrary.", but
no reference to a glibc bug anywhere. The code as-is should
support bit zero, so we still have that free for future use.
The kernel, glibc, and ld.so.cache must coordinate to ensure
that bit values don't go too high and are used consistently.
HP_TIMING uses native timestamping instructions if available, thus
greatly reducing the overhead of recording start and end times for
function calls. For architectures that don't have HP_TIMING
available, we fall back to the clock_gettime bits. One may also
override this by invoking the benchmark as follows:
make USE_CLOCK_GETTIME=1 bench
and get the benchmark results using clock_gettime. One has to do
`make bench-clean` to ensure that the benchmark programs are rebuilt.
Carlos O'Donell [Thu, 9 May 2013 21:37:15 +0000 (17:37 -0400)]
Add more comments to dlclose() algorithm.
The algorithm for scanning dependencies upon dlclose is
less than immediately obvious. This patch adds two bits
of comments that explain why you start the dependency
search at l_initfini[1], and why you need to restart
the search.