Paul E. Murphy [Wed, 1 Jun 2022 16:19:49 +0000 (16:19 +0000)]
nptl_db: disable DT_RELR on libthread_db.so
Some nptl tests inadvertently use the host's gdb to verify
libthread_db.so, which is loaded with the host's runtime. This causes
a couple of test failures when the host glibc does not support DT_RELR.
The not correct, but simple, workaround is to build without DT_RELR
as this library is otherwise likely to load on glibc 2.17 and newer
today.
This allows tst-pthread-gdb-attach{,-static} to continue working
when testing on a gdb loaded with an older glibc.
This avoids a failure in tst-pthread-gdb-attach similar to:
Trying host libthread_db library: .../build/glibc/nptl_db/libthread_db.so.1.
dlopen failed: /lib64/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by .../build/glibc/nptl_db/libthread_db.so.1).
Noah Goldstein [Fri, 3 Jun 2022 23:52:37 +0000 (18:52 -0500)]
x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactions
Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.
Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Noah Goldstein [Tue, 7 Jun 2022 04:11:32 +0000 (21:11 -0700)]
x86: Optimize memrchr-avx2.S
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.
Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Noah Goldstein [Tue, 7 Jun 2022 04:11:31 +0000 (21:11 -0700)]
x86: Optimize memrchr-evex.S
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.
Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Noah Goldstein [Tue, 7 Jun 2022 04:11:30 +0000 (21:11 -0700)]
x86: Optimize memrchr-sse2.S
The new code:
1. prioritizes smaller lengths more.
2. optimizes target placement more carefully.
3. reuses logic more.
4. fixes up various inefficiencies in the logic.
The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874
Regressions:
1. The page cross case is now colder, especially re-entry from the
page cross case if a match is not found in the first VEC
(roughly 50%). My general opinion with this patch is this is
acceptable given the "coldness" of this case (less than 4%) and
generally performance improvement in the other far more common
cases.
2. There are some regressions 5-15% for medium/large user-arg
lengths that have a match in the first VEC. This is because the
logic was rewritten to optimize finds in the first VEC if the
user-arg length is shorter (where we see roughly 20-50%
performance improvements). It is not always the case this is a
regression. My intuition is some frontend quirk is partially
explaining the data although I haven't been able to find the
root cause.
Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Noah Goldstein [Tue, 7 Jun 2022 04:11:29 +0000 (21:11 -0700)]
Benchtests: Improve memrchr benchmarks
Add a second iteration for memrchr to set `pos` starting from the end
of the buffer.
Previously `pos` was only set relative to the beginning of the
buffer. This isn't really useful for memrchr because the beginning
of the search space is (buf + len). Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Noah Goldstein [Tue, 7 Jun 2022 04:11:27 +0000 (21:11 -0700)]
x86: Create header for VEC classes in x86 strings library
This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.
There is no difference in the objdump of libc.so before and after this
patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
powerpc: Fix VSX register number on __strncpy_power9 [BZ #29197]
__strncpy_power9 initializes VR 18 with zeroes to be used throughout the
code, including when zero-padding the destination string. However, the
v18 reference was mistakenly being used for stxv and stxvl, which take a
VSX vector as operand. The code ended up using the uninitialized VSR 18
register by mistake.
Both occurrences have been changed to use the proper VSX number for VR 18
(i.e. VSR 50).
Wilco Dijkstra [Tue, 7 Jun 2022 15:44:35 +0000 (16:44 +0100)]
AArch64: Add SVE memcpy
Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
Cleanup the memcpy and memmove ifunc selectors.
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:
(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.
(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.
(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.
Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)
Geometric mean of all benchmarks: new / old = 0.66
Difficult skiptable(0) : new / old = 0.02
Difficult skiptable(1) : new / old = 0.01
Difficult 2-way : new / old = 0.25
Difficult testing first 2 : new / old = 1.26
Difficult skiptable(0) : new / old = 0.05
Difficult skiptable(1) : new / old = 0.06
Difficult 2-way : new / old = 0.26
Difficult testing first 2 : new / old = 1.05
Difficult skiptable(0) : new / old = 0.42
Difficult skiptable(1) : new / old = 0.24
Difficult 2-way : new / old = 0.21
Difficult testing first 2 : new / old = 1.04 Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Joseph Myers [Mon, 6 Jun 2022 14:47:03 +0000 (14:47 +0000)]
Declare timegm for ISO C2X
The next revision of the ISO C standard has added the timegm function
(that was already supported in glibc). Update the feature test
conditionals on its declaration in <time.h> accordingly.
Florian Weimer [Thu, 2 Jun 2022 14:29:55 +0000 (16:29 +0200)]
Linux: Adjust struct rseq definition to current kernel version
This definition is only used as a fallback with old kernel headers.
The change follows kernel commit bfdf4e6208051ed7165b2e92035b4bf11
("rseq: Remove broken uapi field layout on 32-bit little endian").
WANG Xuerui [Wed, 1 Jun 2022 02:12:28 +0000 (10:12 +0800)]
linux: use statx for fstat if neither newfstatat nor fstatat64 is present
LoongArch is going to be the first architecture supported by Linux that
has neither fstat* nor newfstatat [1], instead exclusively relying on
statx. So in fstatat64's implementation, we need to also enable statx
usage if neither fstatat64 nor newfstatat is present, to prepare for
this new case of kernel ABI.
Joseph Myers [Wed, 1 Jun 2022 14:45:48 +0000 (14:45 +0000)]
Add MADV_DONTNEED_LOCKED from Linux 5.18 to bits/mman-linux.h
Linux 5.18 adds a constant MADV_DONTNEED_LOCKED (defined in multiple
header files, but with the same value on all architectures). Add this
constant to bits/mman-linux.h.
i686: Use generic sinf implementation for SSE2 version
Performance seems to be similar (gcc 11.2.1 on a Ryzen 9 5900X),
the generic algorithm shows slight better performance for
the 'workload-huge.wrf' input set.
Andreas Schwab [Tue, 31 May 2022 11:09:38 +0000 (13:09 +0200)]
x86_64: Optimize sincos where sin/cos is optimized (bug 29193)
The compiler may substitute calls to sin or cos with calls to sincos, thus
we should have the same optimized implementations for sincos. The
optimized implementations may produce results that differ, that also makes
sure that the sincos call aggrees with the sin and cos calls.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on sparc64-linux-gnu and sparcv9-linux-gnu.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0.
The startup code is changed to read the _dl_argc and _dl_argv values,
and envp is calculated from argc and argv.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Different than other architectures, hppa creates an unrelated stack
frame where ld.so argc/argv adjustments done by ad43cac44a6860eaefc
is not done on the argc/argv saved/restore by _dl_start_user.
Instead load _dl_argc and _dl_argv directlty instead of adjust them
using _dl_skip_args value.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. It makes the fixup_stack branch ununsed.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. It makes the _fixup_stack branch ununsed.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. It makes the fixup_stack branch ununsed.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Fangrui Song [Fri, 27 May 2022 19:34:49 +0000 (12:34 -0700)]
dlsym: Make RTLD_NEXT prefer default version definition [BZ #14932]
When the first object providing foo defines both foo@v1 and foo@@v2,
dlsym(RTLD_NEXT, "foo") returns foo@v1 while dlsym(RTLD_DEFAULT, "foo")
returns foo@@v2. The issue is that RTLD_DEFAULT uses the
DL_LOOKUP_RETURN_NEWEST flag while RTLD_NEXT doesn't. Fix the RTLD_NEXT
branch to use DL_LOOKUP_RETURN_NEWEST.
Note: the new behavior matches FreeBSD rtld. Future sanitizers will not
need to add versioned interceptors like https://reviews.llvm.org/D96348
H.J. Lu [Sat, 21 May 2022 02:21:48 +0000 (19:21 -0700)]
x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT
According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we
can ignore their r_addends.
Sunil K Pandey [Mon, 28 Feb 2022 00:39:47 +0000 (16:39 -0800)]
x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlen
This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.
Placeholder function, not used by any processor at the moment.
- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.
Joseph Myers [Thu, 26 May 2022 13:51:17 +0000 (13:51 +0000)]
Update kernel version to 5.18 in header constant tests
This patch updates the kernel version in the tests tst-mman-consts.py
and tst-pidfd-consts.py to 5.18. (There are no new constants covered
by these tests in 5.18, or in 5.17 in the case of tst-pidfd-consts.py
that previously used version 5.16, that need any other header
changes.)
Arjun Shankar [Tue, 24 May 2022 15:57:36 +0000 (17:57 +0200)]
Fix deadlock when pthread_atfork handler calls pthread_atfork or dlclose
In multi-threaded programs, registering via pthread_atfork,
de-registering implicitly via dlclose, or running pthread_atfork
handlers during fork was protected by an internal lock. This meant
that a pthread_atfork handler attempting to register another handler or
dlclose a dynamically loaded library would lead to a deadlock.
This commit fixes the deadlock in the following way:
During the execution of handlers at fork time, the atfork lock is
released prior to the execution of each handler and taken again upon its
return. Any handler registrations or de-registrations that occurred
during the execution of the handler are accounted for before proceeding
with further handler execution.
If a handler that hasn't been executed yet gets de-registered by another
handler during fork, it will not be executed. If a handler gets
registered by another handler during fork, it will not be executed
during that particular fork.
The possibility that handlers may now be registered or deregistered
during handler execution means that identifying the next handler to be
run after a given handler may register/de-register others requires some
bookkeeping. The fork_handler struct has an additional field, 'id',
which is assigned sequentially during registration. Thus, handlers are
executed in ascending order of 'id' during 'prepare', and descending
order of 'id' during parent/child handler execution after the fork.
Two tests are included:
* tst-atfork3: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This test exercises calling dlclose from prepare, parent, and child
handlers.
* tst-atfork4: This test exercises calling pthread_atfork and dlclose
from the prepare handler.
Fangrui Song [Tue, 24 May 2022 02:16:05 +0000 (19:16 -0700)]
elf/dl-reloc.c: Copyright The GNU Toolchain Authors
by following 3.5. Update copyright information
on https://sourceware.org/glibc/wiki/Contribution%20checklist .
The change is advised by Carlos O'Donell.
Noah Goldstein [Mon, 23 May 2022 22:41:55 +0000 (17:41 -0500)]
benchtests: Improve bench-strnlen.c
1. Output results in json format so its easier to parse
2. Increase max alignment to `getpagesize () - 1` to make it possible
to test page cross cases. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Both float, double, and _Float128 are assumed to be supported
(float and double already only uses builtins). Only long double
is parametrized due GCC bug 29253 which prevents its usage on
powerpc.
It allows to remove i686, ia64, x86_64, powerpc, and sparc arch
specific implementation.
On ia64 it also fixes the sNAN handling:
math/test-float64x-fabs
math/test-ldouble-fabs
Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
powerpc64-linux-gnu, sparc64-linux-gnu, and ia64-linux-gnu.
Say both a.so and b.so define protected data symbol `var` and the executable
copy relocates var. ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange
semantics: a.so accesses the copy in the executable while b.so accesses its
own. This behavior requires that (a) the compiler emits GOT-generating
relocations (b) the linker produces GLOB_DAT instead of RELATIVE.
Without the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA code, b.so's GLOB_DAT
will bind to the executable (normal behavior).
For aarch64 it makes sense to restore the original behavior and don't
pay the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA cost. The behavior is very
unlikely used by anyone.
* Clang code generator treats STV_PROTECTED the same way as STV_HIDDEN:
no GOT-generating relocation in the first place.
* gold and lld reject copy relocation on a STV_PROTECTED symbol.
* Nowadays -fpie/-fpic modes are popular. GCC/Clang's codegen uses
GOT-generating relocation when accessing an default visibility
external symbol which avoids copy relocation.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
POSIX reserves the RTLD_ namespace, and this is already reflected in our
conform tests.
Note: RTLD_DEFAULT and RTLD_NEXT appear in IEEE Std 1003.1-2004. Many
systems (e.g. FreeBSD, musl) just define the macros unconditionally.
Noah Goldstein [Thu, 19 May 2022 22:18:03 +0000 (17:18 -0500)]
elf: Optimize _dl_new_hash in dl-new-hash.h
Unroll slightly and enforce good instruction scheduling. This improves
performance on out-of-order machines. The unrolling allows for
pipelined multiplies.
As well, as an optional sysdep, reorder the operations and prevent
reassosiation for better scheduling and higher ILP. This commit
only adds the barrier for x86, although it should be either no
change or a win for any architecture.
Unrolling further started to induce slowdowns for sizes [0, 4]
but can help the loop so if larger sizes are the target further
unrolling can be beneficial.
Results for _dl_new_hash
Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Noah Goldstein [Thu, 19 May 2022 22:18:02 +0000 (17:18 -0500)]
nss: Optimize nss_hash in nss_hash.c
The prior unrolling didn't really do much as it left the dependency
chain between iterations. Unrolled the loop for 4 so 4x multiplies
could be pipelined in out-of-order machines.
Results for __nss_hash
Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Noah Goldstein [Thu, 19 May 2022 22:17:58 +0000 (17:17 -0500)]
elf: Refactor dl_new_hash so it can be tested / benchmarked
No change to the code other than moving the function to
dl-new-hash.h. Changed name so its now in the reserved namespace. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Florian Weimer [Mon, 23 May 2022 08:08:18 +0000 (10:08 +0200)]
locale: Call _nl_unload_locale from _nl_archive_subfreeres
The function performs the same steps for ld_archive locales
(mapped from an archive), and this code is not performance-critical,
so the specialization does not add value.
Florian Weimer [Mon, 23 May 2022 08:08:18 +0000 (10:08 +0200)]
stdio-common: Add tst-memstream-string for open_memstream overflow
This code path is exercised indirectly by some of the DNS stub
resolver tests, via their own use of xopen_memstream for constructing
strings describing result data. The relative lack of test suite
coverage became apparent when these tests starting failing after a
printf changes uncovered bug 28949.