Szabolcs Nagy [Fri, 25 Nov 2022 18:16:07 +0000 (18:16 +0000)]
aarch64: Define jmp_buf offset for GCS
The target specific internal __longjmp is called with a __jmp_buf
argument which has its size exposed in the ABI. On aarch64 this has
no space left, so GCSPR cannot be restored in longjmp in the usual
way, which is needed for the Guarded Control Stack (GCS) extension.
setjmp is implemented via __sigsetjmp which has a jmp_buf argument
however it is also called with __pthread_unwind_buf_t argument cast
to jmp_buf (in cancellation cleanup code built with -fno-exception).
The two types, jmp_buf and __pthread_unwind_buf_t, have common bits
beyond the __jmp_buf field and there is unused space there which we
can use for saving GCSPR.
For this to work some bits of those two generic types have to be
reserved for target specific use and the generic code in glibc has
to ensure that __longjmp is always called with a __jmp_buf that is
embedded into one of those two types. Morally __longjmp should be
changed to take jmp_buf as argument, but that is an intrusive change
across targets.
Note: longjmp is never called with __pthread_unwind_buf_t from user
code, only the internal __libc_longjmp is called with that type and
thus the two types could have separate longjmp implementations on a
target. We don't rely on this now (but might in the future given that
cancellation unwind does not need to restore GCSPR).
Given the above this patch finds an unused slot for GCSPR. This
placement is not exposed in the ABI so it may change in the future.
This is also very target ABI specific so the generic types cannot
be easily changed to clearly mark the reserved fields.
Reviewed-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Szabolcs Nagy [Wed, 22 Feb 2023 14:35:00 +0000 (14:35 +0000)]
aarch64: Add asm helpers for GCS
The Guarded Control Stack instructions can be present even if the
hardware does not support the extension (runtime checked feature),
so the asm code should be backward compatible with old assemblers.
Reviewed-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Malte Skarupke [Wed, 4 Dec 2024 13:05:40 +0000 (08:05 -0500)]
nptl: Use all of g1_start and g_signals
The LSB of g_signals was unused. The LSB of g1_start was used to indicate
which group is G2. This was used to always go to sleep in pthread_cond_wait
if a waiter is in G2. A comment earlier in the file says that this is not
correct to do:
"Waiters cannot determine whether they are currently in G2 or G1 -- but they
do not have to because all they are interested in is whether there are
available signals"
I either would have had to update the comment, or get rid of the check. I
chose to get rid of the check. In fact I don't quite know why it was there.
There will never be available signals for group G2, so we didn't need the
special case. Even if there were, this would just be a spurious wake. This
might have caught some cases where the count has wrapped around, but it
wouldn't reliably do that, (and even if it did, why would you want to force a
sleep in that case?) and we don't support that many concurrent waiters
anyway. Getting rid of it allows us to use one more bit, making us more
robust to wraparound.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Malte Skarupke [Wed, 4 Dec 2024 13:04:10 +0000 (08:04 -0500)]
nptl: Fix indentation
In my previous change I turned a nested loop into a simple loop. I'm doing
the resulting indentation changes in a separate commit to make the diff on
the previous commit easier to review.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Malte Skarupke [Wed, 4 Dec 2024 13:03:44 +0000 (08:03 -0500)]
nptl: Use a single loop in pthread_cond_wait instaed of a nested loop
The loop was a little more complicated than necessary. There was only one
break statement out of the inner loop, and the outer loop was nearly empty.
So just remove the outer loop, moving its code to the one break statement in
the inner loop. This allows us to replace all gotos with break statements.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Malte Skarupke [Wed, 4 Dec 2024 12:56:38 +0000 (07:56 -0500)]
nptl: Remove g_refs from condition variables
This variable used to be needed to wait in group switching until all sleepers
have confirmed that they have woken. This is no longer needed. Nothing waits
on this variable so there is no need to track how many threads are currently
asleep in each group.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Malte Skarupke [Wed, 4 Dec 2024 12:56:13 +0000 (07:56 -0500)]
nptl: Remove unnecessary quadruple check in pthread_cond_wait
pthread_cond_wait was checking whether it was in a closed group no less than
four times. Checking once is enough. Here are the four checks:
1. While spin-waiting. This was dead code: maxspin is set to 0 and has been
for years.
2. Before deciding to go to sleep, and before incrementing grefs: I kept this
3. After incrementing grefs. There is no reason to think that the group would
close while we do an atomic increment. Obviously it could close at any
point, but that doesn't mean we have to recheck after every step. This
check was equally good as check 2, except it has to do more work.
4. When we find ourselves in a group that has a signal. We only get here after
we check that we're not in a closed group. There is no need to check again.
The check would only have helped in cases where the compare_exchange in the
next line would also have failed. Relying on the compare_exchange is fine.
Removing the duplicate checks clarifies the code.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Malte Skarupke [Wed, 4 Dec 2024 12:55:50 +0000 (07:55 -0500)]
nptl: Remove unnecessary catch-all-wake in condvar group switch
This wake is unnecessary. We only switch groups after every sleeper in a group
has been woken. Sure, they may take a while to actually wake up and may still
hold a reference, but waking them a second time doesn't speed that up. Instead
this just makes the code more complicated and may hide problems.
In particular this safety wake wouldn't even have helped with the bug that was
fixed by Barrus' patch: The bug there was that pthread_cond_signal would not
switch g1 when it should, so we wouldn't even have entered this code path.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Frank Barrus [Wed, 4 Dec 2024 12:55:02 +0000 (07:55 -0500)]
pthreads NPTL: lost wakeup fix 2
This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch. This greatly reduces the average
and maximum latency for pthread_cond_signal.
The g_signals[] field now contains a signal count that is relative to
the current g1_start value. Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)
By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same. This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change. This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.
By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore. The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit. If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.
The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters. Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2. They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.
Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block. This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.
Signed-off-by: Frank Barrus <frankbarrus_sw@shaggy.cc> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Florian Weimer [Thu, 16 Jan 2025 19:02:42 +0000 (20:02 +0100)]
Linux: Add tests that check that TLS and rseq area are separate
The new test elf/tst-rseq-tls-range-4096-static reliably detected
the extra TLS allocation problem (tcb_offset was dropped from
the allocation size) on aarch64. It also failed with a crash
in dlopen *before* the extra TLS changes, so TLS alignment with
static dlopen was already broken.
Reviewed-by: Michael Jeanson <mjeanson@efficios.com>
Florian Weimer [Thu, 16 Jan 2025 19:02:42 +0000 (20:02 +0100)]
Consolidate TLS block allocation for static binaries with ld.so
Use the same code to compute the TLS block size and its alignment.
The code in elf/dl-tls.c is linked in anyway for all binaries
due to the reference to _dl_tls_static_surplus_init.
It is not possible to call _dl_allocate_tls_storage directly
because malloc is not available in the static case. (The
dynamic linker uses the minimal malloc at this stage.) Therefore,
split _dl_tls_block_size_with_pre and _dl_tls_block_align from
_dl_allocate_tls_storage, and call those new functions from
__libc_setup_tls.
This fixes extra TLS allocation for the static case, and apparently
some pre-existing bugs as well (the independent recomputation of
TLS block sizes in init_static_tls looks rather suspect).
Florian Weimer [Thu, 16 Jan 2025 19:02:42 +0000 (20:02 +0100)]
elf: Iterate over loaded object list in _dl_determine_tlsoffset
The old code used the slotinfo array as a scratch area to pass the
list of TLS-using objects to _dl_determine_tlsoffset. All array
entries are subsequently overwritten by _dl_add_to_slotinfo,
except the first one. The link maps are usually not at their
right position for their module ID in the slotinfo array, so
the initial use of the slotinfo array would be incorrect if not
for scratch purposes only.
In _dl_tls_initial_modid_limit_setup, the old code relied that
some link map was written to the first slotinfo entry. After the
change, this no longer happens because TLS module ID zero is unused.
It's also necessary to move the call after the real initialization
of the slotinfo array.
Florian Weimer [Thu, 16 Jan 2025 18:59:58 +0000 (19:59 +0100)]
benchtests: Add dummy in put files cospi, cospif, sinpi, sinpif, tanpi, tanpif
This fixes an AArch64 build failure:
python3 -B ../sysdeps/aarch64/fpu/scripts/bench_libmvec_advsimd.py bench-float-advsimd-cospi > …/benchtests/bench-float-advsimd-cospi.c
Traceback (most recent call last):
File "…/sysdeps/aarch64/fpu/scripts/bench_libmvec_advsimd.py", line 106, in <module>
main(sys.argv[1])
~~~~^^^^^^^^^^^^^
File "…/sysdeps/aarch64/fpu/scripts/bench_libmvec_advsimd.py", line 81, in main
with open(f"../benchtests/libmvec/{input_filename}") as f:
~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '../benchtests/libmvec/cospif-inputs'
Florian Weimer [Thu, 16 Jan 2025 17:45:25 +0000 (18:45 +0100)]
Linux: Fixes for getrandom fork handling
Careful updates of grnd_alloc.len are required to ensure that
after fork, grnd_alloc.states does not contain entries that
are also encountered by __getrandom_reset_state in TCBs.
For the same reason, it is necessary to overwrite the TCB state
pointer with NULL before updating grnd_alloc.states in
__getrandom_vdso_release.
Before this change, different TCBs could share the same getrandom
state after multi-threaded fork. This would be a critical security
bug (predictable randomness) if not caught during development.
The additional check in stdlib/tst-arc4random-thread makes it more
likely that the test fails due to the bugs mentioned above.
Both __getrandom_reset_state and __getrandom_vdso_release could
put reserved NULL pointers into the states array. This is also
fixed with this commit. After these changes, no null pointers were
observed in the states array during testing.
Stefan Liebler [Fri, 10 Jan 2025 17:55:50 +0000 (12:55 -0500)]
affinity-inheritance: Overallocate CPU sets
Some kernels on S390 appear to return a CPU affinity mask based on
configured processors rather than the ones online. Overallocate the CPU
set to match that, but operate only on the ones online.
mirabilos [Mon, 13 Jan 2025 14:24:37 +0000 (11:24 -0300)]
sh4: ensure FPSCR.PR==0 when executing FRCHG [BZ #27543]
If the bit is not 0, the operations FRCHG and FSCHG are
undefined and cause a trap; qemu now checks for this as
well, so we set it to 0 temporarily and restore the old
value in getcontext afterwards (setcontext/swapcontext
already do so).
From the discussion in the bugreport, this can probably
be optimised in one place but none of the people involved
are SH4 assembly experts, this patch is field-tested, and
it’s not a code path run often. The other question, what
happens if a signal occurs while the bit is temporarily 0,
is also still unsolved, but to fix that a kernel change is
most likely needed; this patch changes a certain trap on
many CPUs for a hard-to-get trap in a signal handler if a
signal is delivered during the few instructions the PR bit
is temporarily set to 0, so it’s not a regression for most
users.
See BZ and https://bugs.launchpad.net/qemu/+bug/1796520 for
related discussion, references and review comments.
Signed-off-by: mirabilos <tg@debian.org> Reviewed-by: Oleg Endo <olegendo@gcc.gnu.org> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Dan Luedtke [Wed, 4 Dec 2024 19:50:22 +0000 (11:50 -0800)]
inet: Add common IPv6 packet header macros
Adds commonly used IPv6 packet header macros similar to what is available
on NetBSD and FreeBSD in sys/netinet/ip6.h and Android in
libc/include/netinet/ip6.h
Usage example IPV6_VERSION_MASK and IPV6_VERSION:
if ((ip6->ip6_vfc & IPV6_VERSION_MASK) == IPV6_VERSION)
return true;
Usage example IPV6_FLOWINFO_MASK:
ip6->ip6_flow = (flow & IPV6_FLOWINFO_MASK);
The relevant standard is RFC2460 (Internet Protocol, Version 6
Specification). It defines the Internet Protocol version (IPV6_VERSION)
and reduced the size of the flow label field from 24 to 20 bits
(IPV6_FLOWLABEL_MASK). The traffic class and flow label fields together
make up the flow information (IPV6_FLOWINFO_MASK).
Tested on x86_64 GNU/Linux
Signed-off-by: Dan Luedtke <danrl@google.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
stdio-common: Suppress Clang warnings on scanf13.c with fortify enable
clang-19 shows:
scanf13.c:28:40: error: 'sscanf' may overflow; destination buffer in argument 4 has size 8, but the corresponding specifier may require size 11 [-Werror,-Wfortify-source]
28 | "A%ms%10ms%4m[bcd]%4mcB", &sp1, &sp2, &sp3, &sp4) != 4)
| ^
scanf13.c:94:34: error: 'sscanf' may overflow; destination buffer in argument 3 has size 8, but the corresponding specifier may require size 2049 [-Werror,-Wfortify-source]
94 | if (sscanf (buf, "%2048ms%mc", &sp3, &sp4) != 2)
| ^
scanf13.c:110:61: error: 'sscanf' may overflow; destination buffer in argument 4 has size 8, but the corresponding specifier may require size 1501 [-Werror,-Wfortify-source]
110 | if (sscanf (buf, "%4mc%1500m[dr/]%548m[abc/d]%3mc", &sp1, &sp2, &sp3, &sp4)
| ^
scanf13.c:110:67: error: 'sscanf' may overflow; destination buffer in argument 5 has size 8, but the corresponding specifier may require size 549 [-Werror,-Wfortify-source]
110 | if (sscanf (buf, "%4mc%1500m[dr/]%548m[abc/d]%3mc", &sp1, &sp2, &sp3, &sp4)
clang does have some support to handle 'm' prefix for -Wformat; but it
lacks support for -Wfortify to understand that it is up to libc to
allocate the memory, and uses the pointer size instead to calculate
validity.
In this patch, the RPC is used to implement the monotonic clock for
mach.
* config.h.in: Add HAVE_HOST_GET_UPTIME64 config entry
* sysdeps/mach/clock_gettime.c: Add CLOCK_MONOTONIC case
* sysdeps/mach/configure: Check the existence of host_get_uptime64 RPC
* sysdeps/mach/configure.ac: Check the existence of host_get_uptime64 RPC
Michael Jeanson [Wed, 10 Jul 2024 19:48:49 +0000 (15:48 -0400)]
nptl: Move the rseq area to the 'extra TLS' block
Move the rseq area to the newly added 'extra TLS' block, this is the
last step in adding support for the rseq extended ABI. The size of the
rseq area is now dynamic and depends on the rseq features reported by
the kernel through the elf auxiliary vector. This will allow
applications to use rseq features past the 32 bytes of the original rseq
ABI as they become available in future kernels.
Michael Jeanson [Thu, 1 Aug 2024 14:35:34 +0000 (10:35 -0400)]
nptl: Introduce <rseq-access.h> for RSEQ_* accessors
In preparation to move the rseq area to the 'extra TLS' block, we need
accessors based on the thread pointer and the rseq offset. The ONCE
variant of the accessors ensures single-copy atomicity for loads and
stores which is required for all fields once the registration is active.
A separate header is required to allow including <atomic.h> which
results in an include loop when added to <tcb-access.h>.
Michael Jeanson [Wed, 20 Nov 2024 22:28:07 +0000 (22:28 +0000)]
nptl: add rtld_hidden_proto to __rseq_size and __rseq_offset
This allows accessing the internal aliases of __rseq_size and
__rseq_offset from ld.so without ifdefs and avoids dynamic symbol
binding at run time for both variables.
Michael Jeanson [Wed, 10 Jul 2024 19:48:11 +0000 (15:48 -0400)]
Add generic 'extra TLS'
Add the logic to append an 'extra TLS' block in the TLS block allocator
with a generic stub implementation. The duplicated code in
'csu/libc-tls.c' and 'elf/dl-tls.c' is to handle both statically linked
applications and the ELF dynamic loader.
Michael Jeanson [Wed, 10 Jul 2024 19:37:28 +0000 (15:37 -0400)]
nptl: Add rseq auxvals
Get the rseq feature size and alignment requirement from the auxiliary
vector for use inside the dynamic loader. Use '__rseq_size' directly to
store the feature size. If the main thread registration fails or is
disabled by tunable, reset the value to 0.
This will be used in the TLS block allocator to compute the size and
alignment of the rseq area block for the extended ABI support.
Add a couple of tests to verify that CPU affinity set using
sched_setaffinity and pthread_setaffinity_np are inherited by a child
process and child thread.
Florian Weimer [Wed, 8 Jan 2025 15:55:31 +0000 (16:55 +0100)]
elf: Minimize library dependencies of tst-nolink-libc.c
On 32-bit Arm, -fasynchronous-unwind-tables creates a reference
to the symbol __aeabi_unwind_cpp_pr0. Compile the tests without
this flag even if it is passed as part of CC, to avoid linker
failures.
Samuel Thibault [Tue, 7 Jan 2025 01:36:55 +0000 (02:36 +0100)]
include/string.h: Also redirect calls if not inlined in libpthread
htl's pt-alloc.c calls __mempcpy, which is #defined to
__builtin_mempcpy, but which does not happen to get inlined (the size is
dynamic), and then gcc emits a reference to mempcpy, thus violating
symbol exposition standard. We thus also have to redirect such
references to __mempcpy too.
Linux bogsucker 6.1.55-gentoo-dist-hardened #1 SMP Sun Oct 1 18:03:02 UTC 2023 ppc64le POWER9 (architected), altivec supported CHRP IBM pSeries (emulated by qemu) GNU/Linux
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Florian Weimer [Tue, 7 Jan 2025 08:18:07 +0000 (09:18 +0100)]
elf: Second ld.so relocation only if libc.so has been loaded
Commit 8f8dd904c4a2207699bb666f30acceb5209c8d3f (“elf:
rtld_multiple_ref is always true”) removed some code that happened
to enable compatibility with programs that do not link against
libc.so. Such programs cannot call dlopen or any dynamic linker
functions (except __tls_get_addr), so this is not really useful.
Still ld.so should not crash with a null-pointer dereference
or undefined symbol reference in these cases.
In the main relocation loop, call _dl_relocate_object unconditionally
because it already checks if the object has been relocated.
If libc.so was loaded, self-relocate ld.so against it and call
__rtld_mutex_init and __rtld_malloc_init_real to activate the full
implementations. Those are available only if libc.so is there,
so skip these initialization steps if libc.so is absent. Without
libc.so, the global scope can be completely empty. This can cause
ld.so self-relocation to fail because if it uses symbol-based
relocations, which is why the second ld.so self-relocation is not
performed if libc.so is missing.
The previous concern regarding GOT updates through self-relocation
no longer applies because function pointers are updated
explicitly through __rtld_mutex_init and __rtld_malloc_init_real,
and not through relocation. However, the second ld.so self-relocation
is still delayed, in case there are other symbols being used.
Samuel Thibault [Tue, 7 Jan 2025 00:56:41 +0000 (01:56 +0100)]
tst-xdirent: Fix allocating dirent for readdir_r call
As documented in the glibc manual, “Some systems don’t define the d_name
element sufficiently long”, and it provides an example of using a union to
properly allocate the storage under the dirent.
Michael Jeanson [Wed, 31 Jul 2024 21:51:16 +0000 (17:51 -0400)]
nptl: Add <thread_pointer.h> for C-SKY
This will be required by the rseq extensible ABI implementation on all
Linux architectures exposing the '__rseq_size' and '__rseq_offset'
symbols to set the initial value of the 'cpu_id' field which can be used
by applications to test if rseq is available and registered. As long as
the symbols are exposed it is valid for an application to perform this
test even if rseq is not yet implemented in libc for this architecture.
Compile tested with build-many-glibcs.py but I don't have access to any
hardware to run the tests.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com>
Michael Jeanson [Wed, 31 Jul 2024 21:34:54 +0000 (17:34 -0400)]
nptl: Add <thread_pointer.h> for microblaze
This will be required by the rseq extensible ABI implementation on all
Linux architectures exposing the '__rseq_size' and '__rseq_offset'
symbols to set the initial value of the 'cpu_id' field which can be used
by applications to test if rseq is available and registered. As long as
the symbols are exposed it is valid for an application to perform this
test even if rseq is not yet implemented in libc for this architecture.
Compile tested with build-many-glibcs.py but I don't have access to any
hardware to run the tests.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com>
Luna Lamb [Fri, 3 Jan 2025 20:15:17 +0000 (20:15 +0000)]
AArch64: Improve codegen in SVE expm1f and users
Use unpredicated muls, use absolute compare and improve memory access.
Expm1f, sinhf and tanhf show 7%, 5% and 1% improvement in throughput
microbenchmark on Neoverse V1.
Joe Ramsay [Fri, 3 Jan 2025 19:13:36 +0000 (19:13 +0000)]
math: Remove no-mathvec flag
More routines are to follow, some of which hit many failures in the
current testsuite due to wrong sign of zero (mathvec routines are not
required to get this right). Instead of disabling a large number of
tests, change the failure condition such that, for vector routines,
tests pass as long as computed == expected == 0.0, regardless of sign.
Affected tests (vector tests for expm1, log1p, sin, tan and tanh) all
still pass.
Yat Long Poon [Fri, 3 Jan 2025 19:09:05 +0000 (19:09 +0000)]
AArch64: Improve codegen for SVE log1pf users
Reduce memory access by using lanewise MLA and reduce number of MOVPRFXs.
Move log1pf implementation to inline helper function.
Speedup on Neoverse V1 for log1pf (10%), acoshf (-1%), atanhf (2%), asinhf (2%).
Yat Long Poon [Fri, 3 Jan 2025 19:07:30 +0000 (19:07 +0000)]
AArch64: Improve codegen for SVE logs
Reduce memory access by using lanewise MLA and moving constants to struct
and reduce number of MOVPRFXs.
Update maximum ULP error for double log_sve from 1 to 2.
Speedup on Neoverse V1 for log (3%), log2 (5%), and log10 (4%).
H.J. Lu [Fri, 3 Jan 2025 02:21:56 +0000 (10:21 +0800)]
Rename have-mtls-descriptor to have-test-mtls-descriptor
Since have-mtls-descriptor is only used for glibc testing, rename it to
have-test-mtls-descriptor. Also enable tst-gnu2-tls2-amx only if
$(have-test-mtls-descriptor) == gnu2.
Tested with GCC 14 and Clang 19/18/17 on x86-64.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sam James <sam@gentoo.org>
Linux timberdoodle 6.1.60-gentoo-dist-hardened #1 SMP Fri Dec 1 22:10:49 UTC 2023 ppc64 POWER9 (architected), altivec supported CHRP IBM pSeries (emulated by qemu) GNU/Linux
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>