sourceware.org Git - glibc.git/log

elf: Move LAV_CURRENT to link_lavcurrent.h

No functional change.

Move assignment out of the CAS condition

Update

commit 49302b8fdf9103b6fc0a398678668a22fa19574c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Nov 11 06:54:01 2021 -0800

    Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]

    Replace boolean CAS with value CAS to avoid the extra load.

and

commit 0b82747dc48d5bf0871bdc6da8cb6eec1256355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Nov 11 06:31:51 2021 -0800

    Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]

    Replace boolean CAS with value CAS to avoid the extra load.

by moving assignment out of the CAS condition.

Add a comment for --enable-initfini-array [BZ #27945]

Document that --enable-initfini-array is enabled by default in GCC 12,
which can be removed when GCC 12 becomes the minimum requirement.

tst-tzset: output reason when creating 4GiB file fails

Currently, if the temporary file creation fails the create_tz_file
function returns NULL. The NULL pointer is then passed to setenv which
causes a SIGSEGV. Rather than failing with a SIGSEGV print a warning
and exit.

Add LLL_MUTEX_READ_LOCK [BZ #28537]

CAS instruction is expensive. From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading. See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail to reduce cache line bouncing on contended locks.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]

Replace boolean CAS with value CAS to avoid the extra load.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]

Replace boolean CAS with value CAS to avoid the extra load.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

String: Split memcpy tests so that parallel build is faster

No bug.

This commit splits test-memcpy.c into test-memcpy.c and
test-memcpy-large.c. The idea is parallel builds will be able to run
both in parallel speeding up the process.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

x86: Shrink memcmp-sse4.S code size

No bug.

This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.

memcmp-sse4 code size reduction : -3487 bytes
wmemcmp-sse4 code size reduction: -1472 bytes

The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.

See https://research.google/pubs/pub48320/ for more details.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Support C2X printf %b, %B

C2X adds a printf %b format (see
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2630.pdf>, accepted
for C2X), for outputting integers in binary.  It also has recommended
practice for a corresponding %B format (like %b, but %#B starts the
output with 0B instead of 0b).  Add support for these formats to
glibc.

One existing test uses %b as an example of an unknown format, to test
how glibc printf handles unknown formats; change that to %v.  Use of
%b and %B as user-registered format specifiers continues to work (and
we already have a test that covers that, tst-printfsz.c).

Note that C2X also has scanf %b support, plus support for binary
constants starting 0b in strtol (base 0 and 2) and scanf %i (strtol
base 0 and scanf %i coming from a previous paper that added binary
integer literals).  I intend to implement those features in a separate
patch or patches; as discussed in the thread starting at
<https://sourceware.org/pipermail/libc-alpha/2020-December/120414.html>,
they will be more complicated because they involve adding extra public
symbols to ensure compatibility with existing code that might not
expect 0b constants to be handled by strtol base 0 and 2 and scanf %i,
whereas simply adding a new format specifier poses no such
compatibility concerns.

Note that the actual conversion from integer to string uses existing
code in _itoa.c.  That code has special cases for bases 8, 10 and 16,
probably so that the compiler can optimize division by an integer
constant in the code for those bases.  If desired such special cases
could easily be added for base 2 as well, but that would be an
optimization, not actually needed for these printf formats to work.

Tested for x86_64 and x86.  Also tested with build-many-glibcs.py for
aarch64-linux-gnu with GCC mainline to make sure that the test does
indeed build with GCC 12 (where format checking warnings are enabled
for most of the test).

Update syscall lists for Linux 5.15

Linux 5.15 has one new syscall, process_mrelease (and also enables the
clone3 syscall for RV32). It also has a macro __NR_SYSCALL_MASK for
Arm, which is not a syscall but matches the pattern used for syscall
macro names.

Add __NR_SYSCALL_MASK to the names filtered out in the code dealing
with syscall lists, update syscall-names.list for the new syscall and
regenerate the arch-syscall.h headers with build-many-glibcs.py
update-syscalls.

Tested with build-many-glibcs.py.

s390: Use long branches across object boundaries (jgh instead of jh)

Depending on the layout chosen by the linker, the 16-bit displacement
of the jh instruction is insufficient to reach the target label.

Analysis of the linker failure was carried out by Nick Clifton.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Stefan Liebler <stli@linux.ibm.com>

Remove the unused +mkdep/+make-deps/s-proto.S/s-proto-cancel.S

Since

commit d73f5331ce5370ca5a879229e3842f5de98689cd
Author: Roland McGrath <roland@gnu.org>
Date:   Fri May 2 02:20:45 2003 +0000

    2003-05-01  Roland McGrath  <roland@redhat.com>

dependency is generated by passing -MD -MF to compiler.  Remove the unused
+mkdep, +make-deps, s-proto.S and s-proto-cancel.S.

This fixes BZ #28554.

Fix build a chec failures after b05fae4d8e34

The include cleanup on dl-minimal.c removed too much for some
targets.

Also for Hurd, __sbrk is removed from localplt.data now that
tunables allocated memory through mmap.

Checked with a build for all affected architectures.

elf: Use the minimal malloc on tunables_strdup

The rtld_malloc functions are moved to its own file so it can be
used on csu code. Also, the functiosn are renamed to __minimal_*
(since there are now used not only on loader code).

Using the __minimal_malloc on tunables_strdup() avoids potential
issues with sbrk() calls while processing the tunables (I see
sporadic elf/tst-dso-ordering9 on powerpc64le with different
tests failing due ASLR).

Also, using __minimal_malloc over plain mmap optimizes the memory
allocation on both static and dynamic case (since it will any unused
space in either the last page of data segments, avoiding mmap() call,
or from the previous mmap() call).

Checked on x86_64-linux-gnu, i686-linux-gnu, and powerpc64le-linux-gnu.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>

Fix memmove call in vfprintf-internal.c:group_number

A recent GCC mainline change introduces errors of the form:

vfprintf-internal.c: In function 'group_number':
vfprintf-internal.c:2093:15: error: 'memmove' specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807 [-Werror=stringop-overflow=]
2093 |               memmove (w, s, (front_ptr -s) * sizeof (CHAR_T));
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a genuine bug in the glibc code: s > front_ptr is always true
at this point in the code, and the intent is clearly for the
subtraction to be the other way round.  The other arguments to the
memmove call here also appear to be wrong; w and s point just *after*
the destination and source for copying the rest of the number, so the
size needs to be subtracted to get appropriate pointers for the
copying.  Adjust the memmove call to conform to the apparent intent of
the code, so fixing the -Wstringop-overflow error.

Now, if the original code were ever executed, a buffer overrun would
result.  However, I believe this code (introduced in commit
edc1686af0c0fc2eb535f1d38cdf63c1a5a03675, "vfprintf: Reuse work_buffer
in group_number", so in glibc 2.26) is unreachable in prior glibc
releases (so there is no need for a bug in Bugzilla, no need to
consider any backports unless someone wants to build older glibc
releases with GCC 12 and no possibility of this buffer overrun
resulting in a security issue).

work_buffer is 1000 bytes / 250 wide characters.  This case is only
reachable if an initial part of the number, plus a grouped copy of the
rest of the number, fail to fit in that space; that is, if the grouped
number fails to fit in the space.  In the wide character case,
grouping is always one wide character, so even with a locale (of which
there aren't any in glibc) grouping every digit, a number would need
to occupy at least 125 wide characters to overflow, and a 64-bit
integer occupies at most 23 characters in octal including a leading 0.
In the narrow character case, the multibyte encoding of the grouping
separator would need to be at least 42 bytes to overflow, again
supposing grouping every digit, but MB_LEN_MAX is 16.  So even if we
admit the case of artificially constructed locales not shipped with
glibc, given that such a locale would need to use one of the character
sets supported by glibc, this code cannot be reached at present.  (And
POSIX only actually specifies the ' flag for grouping for decimal
output, though glibc acts on it for other bases as well.)

With binary output (if you consider use of grouping there to be
valid), you'd need a 15-byte multibyte character for overflow; I don't
know if any supported character set has such a character (if, again,
we admit constructed locales using grouping every digit and a grouping
separator chosen to have a multibyte encoding as long as possible, as
well as accepting use of grouping with binary), but given that we have
this code at all (clearly it's not *correct*, or in accordance with
the principle of avoiding arbitrary limits, to skip grouping on
running out of internal space like that), I don't think it should need
any further changes for binary printf support to go in.

On the other hand, support for large sizes of _BitInt in printf (see
the N2858 proposal) *would* require something to be done about such
arbitrary limits (presumably using dynamic allocation in printf again,
for sufficiently large _BitInt arguments only - currently only
floating-point uses dynamic allocation, and, as previously discussed,
that could actually be replaced by bounded allocation given smarter
code).

Tested with build-many-glibcs.py for aarch64-linux-gnu (GCC mainline).
Also tested natively for x86_64.

locale: Fix localedata/sort-test undefined behavior

The collate-test.c triggers UB with an signed integer overflow,
which results in an error on some architectures (powerpc32).

Checked on x86_64, i686, and powerpc.

test-memcpy.c: Double TIMEOUT to (8 * 60)

commit d585ba47fcda99fdf228e3e45a01b11a15efbc5a
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Nov 1 00:49:48 2021 -0500

    string: Make tests birdirectional test-memcpy.c

    This commit updates the memcpy tests to test both dst > src and dst <
    src. This is because there is logic in the code based on the

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
significantly increased the number of tests.  On Intel Core i7-1165G7,
test-memcpy takes 120 seconds to run when machine is idle.  Double
TIMEOUT to (8 * 60) for test-memcpy to avoid timeout when machine is
under heavy load.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>

hurd: Remove unused __libc_close_range

That was just cargo-culted.

hurd: Implement close_range and closefrom

The close_range () function implements the same API as the Linux and
FreeBSD syscalls. It operates atomically and reliably. The specified
upper bound is clamped to the actual size of the file descriptor table;
it is expected that the most common use case is with last = UINT_MAX.

Like in the Linux syscall, it is also possible to pass the
CLOSE_RANGE_CLOEXEC flag to mark the file descriptors in the range
cloexec instead of acually closing them.

Also, add a Hurd version of the closefrom () function. Since unlike on
Linux, close_range () cannot fail due to being unuspported by the
running kernel, a fallback implementation is never necessary.

Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Message-Id: <20211106153524.82700-1-bugaevc@gmail.com>

x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h

No bug.

This patch doubles the rep_movsb_threshold when using ERMS. Based on
benchmarks the vector copy loop, especially now that it handles 4k
aliasing, is better for these medium ranged.

On Skylake with ERMS:

Size,   Align1, Align2, dst>src,(rep movsb) / (vec copy)
4096,   0,      0,      0,      0.975
4096,   0,      0,      1,      0.953
4096,   12,     0,      0,      0.969
4096,   12,     0,      1,      0.872
4096,   44,     0,      0,      0.979
4096,   44,     0,      1,      0.83
4096,   0,      12,     0,      1.006
4096,   0,      12,     1,      0.989
4096,   0,      44,     0,      0.739
4096,   0,      44,     1,      0.942
4096,   12,     12,     0,      1.009
4096,   12,     12,     1,      0.973
4096,   44,     44,     0,      0.791
4096,   44,     44,     1,      0.961
4096,   2048,   0,      0,      0.978
4096,   2048,   0,      1,      0.951
4096,   2060,   0,      0,      0.986
4096,   2060,   0,      1,      0.963
4096,   2048,   12,     0,      0.971
4096,   2048,   12,     1,      0.941
4096,   2060,   12,     0,      0.977
4096,   2060,   12,     1,      0.949
8192,   0,      0,      0,      0.85
8192,   0,      0,      1,      0.845
8192,   13,     0,      0,      0.937
8192,   13,     0,      1,      0.939
8192,   45,     0,      0,      0.932
8192,   45,     0,      1,      0.927
8192,   0,      13,     0,      0.621
8192,   0,      13,     1,      0.62
8192,   0,      45,     0,      0.53
8192,   0,      45,     1,      0.516
8192,   13,     13,     0,      0.664
8192,   13,     13,     1,      0.659
8192,   45,     45,     0,      0.593
8192,   45,     45,     1,      0.575
8192,   2048,   0,      0,      0.854
8192,   2048,   0,      1,      0.834
8192,   2061,   0,      0,      0.863
8192,   2061,   0,      1,      0.857
8192,   2048,   13,     0,      0.63
8192,   2048,   13,     1,      0.629
8192,   2061,   13,     0,      0.627
8192,   2061,   13,     1,      0.62

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

x86: Optimize memmove-vec-unaligned-erms.S

No bug.

The optimizations are as follows:

1) Always align entry to 64 bytes. This makes behavior more
   predictable and makes other frontend optimizations easier.

2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have
   significant benefits in the case that:
        0 < (dst - src) < [256, 512]

3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%]
   improvement and for FSRM [-10%, 25%].

In addition to these primary changes there is general cleanup
throughout to optimize the aligning routines and control flow logic.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

benchtests: Add partial overlap case in bench-memmove-walk.c

This commit adds a new partial overlap benchmark. This is generally
the most interesting performance case for memmove and was missing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

benchtests: Add additional cases to bench-memcpy.c and bench-memmove.c

This commit adds more benchmarks for the common memcpy/memmove
benchmarks. The most signifcant cases are the half page offsets. The
current versions leaves dst and src near page aligned which leads to
false 4k aliasing on x86_64. This can add noise due to false
dependencies from one run to the next. As well, this seems like more
of an edge case that common case so it shouldn't be the only thing

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

string: Make tests birdirectional test-memcpy.c

This commit updates the memcpy tests to test both dst > src and dst <
src. This is because there is logic in the code based on the

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Remove the last trace of generate-md5 [BZ #28554]

generate-md5 was removed by

commit d73f5331ce5370ca5a879229e3842f5de98689cd
Author: Roland McGrath <roland@gnu.org>
Date:   Fri May 2 02:20:45 2003 +0000

    2003-05-01  Roland McGrath  <roland@redhat.com>

Remove its last trace.  This fixes BZ #28554.

Revert "benchtests: Add acosf function to bench-math"

This reverts commit 79d0fc65395716c1d95931064c7bf37852203c66.

Configure GCC with --enable-initfini-array [BZ #27945]

Starting from GCC 12, the .init_array and .fini_array sections are enabled
unconditionally by

commit 13a39886940331149173b25d6ebde0850668d8b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Tue Jun 8 16:09:24 2021 -0700

Always enable DT_INIT_ARRAY/DT_FINI_ARRAY on Linux

configure GCC with --enable-initfini-array to enable them when using GCC
release branches.

Fixes BZ #27945.

elf: Earlier missing dynamic segment check in _dl_map_object_from_fd

Separated debuginfo files have PT_DYNAMIC with p_filesz == 0. We
need to check for that before the _dl_map_segments call because
that could attempt to write to mappings that extend beyond the end
of the file, resulting in SIGBUS.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

gconv: Do not emit spurious NUL character in ISO-2022-JP-3 (bug 28524)

Bugfix 27256 has introduced another issue:
In conversion from ISO-2022-JP-3 encoding, it is possible
to force iconv to emit extra NUL character on internal state reset.
To do this, it is sufficient to feed iconv with escape sequence
which switches active character set.
The simplified check 'data->__statep->__count != ASCII_set'
introduced by the aforementioned bugfix picks that case and
behaves as if '\0' character has been queued thus emitting it.

To eliminate this issue, these steps are taken:
* Restore original condition
'(data->__statep->__count & ~7) != ASCII_set'.
It is necessary since bits 0-2 may contain
number of buffered input characters.
* Check that queued character is not NUL.
Similar step is taken for main conversion loop.

Bundled test case follows following logic:
* Try to convert ISO-2022-JP-3 escape sequence
switching active character set
* Reset internal state by providing NULL as input buffer
* Ensure that nothing has been converted.

Signed-off-by: Nikita Popov <npv1310@gmail.com>

[powerpc] Tighten contraints for asm constant parameters

There are a few places where only known numeric values are acceptable for
`asm` parameters, yet the constraint "i" is used. "i" can include
"symbolic constants whose values will be known only at assembly time or
later."

Use "n" instead of "i" where known numeric values are required.

Suggested-by: Segher Boessenkool <segher@kernel.crashing.org>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>

elf: Do not run DSO sorting if tunables is not enabled

Since the argorithm selection requires tunables.

Checked on x86_64-linux-gnu with --enable-tunables=no.

riscv: Build with -mno-relax if linker does not support R_RISCV_ALIGN

It allows build both glibc and tests with lld (Since lld does not
support R_RISCV_ALIGN linker relaxation).

Checked with a build for riscv32-linux-gnu-rv32imafdc-ilp32d and
riscv64-linux-gnu-rv64imafdc-lp64d.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Fangrui Song <maskray@google.com>

x86-64: Replace movzx with movzbl

Clang cannot assemble movzx in the AT&T dialect mode.

../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction
movzx (%rsi), %ecx
^~~~

Change movzx to movzbl, which follows the AT&T dialect and is used
elsewhere in the file.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

regex: Unnest nested functions in regcomp.c

This refactor moves four functions out of a nested scope and converts
them into static always_inline functions. collseqwc, table_size,
symb_table, extra are now initialized to zero because they are passed as
function arguments.

On x86-64, .text is 16 byte larger likely due to the 4 stores.
This is nothing compared to the amount of work that regcomp has to do
looking up the collation weights, or other functions.

If the non-buildable `sysdeps/generic/dl-machine.h` doesn't count,
this patch removes the last `auto inline` usage from glibc.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

Use Linux 5.15 in build-many-glibcs.py

This patch makes build-many-glibcs.py use Linux 5.15.

Tested with build-many-glibcs.py (host-libraries, compilers and glibcs
builds).

elf: Assume disjointed .rela.dyn and .rela.plt for loader

The patch removes the the ELF_DURING_STARTUP optimization and assume
both .rel.dyn and .rel.plt might not be subsequent.  This allows some
code simplification since relocation will be handled independently
where it is done on bootstrap.

At least on x86_64_64, I can not measure any performance implications.
Running 10000 time the command

  LD_DEBUG=statistics ./elf/ld.so ./libc.so

And filtering the "total startup time in dynamic loader" result,
the geometric mean is:

                  patched       master
  Ryzen 7 5900x     24140        24952
  i7-4510U          45957        45982

(The results do show some variation, I did not make any statistical
analysis).

It also allows build arm with lld, since it inserts ".ARM.exidx"
between ".rel.dyn" and ".rel.plt" for the loader.

Checked on x86_64-linux-gnu and arm-linux-gnueabihf.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

i386: Explain why __HAVE_64B_ATOMICS has to be 0

benchtests: Add hypotf

Based on random input arguments. About 85% tuples have exponents
of the two arguments close together (+-1 range).

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

benchtests: Make hypot input random

Instead of inputs based on the algorithm implementation details.
About 85% tuples have exponents of the two arguments close
together (+-1 range).

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

arm: Use have-mtls-dialect-gnu2 to check for ARM TLS descriptors support

The lld linker does not support TLSDESC for arm. The have-arm-tls-desc
is a leftover of 56583289b1 to support NaCL.

Reviewed-by: Fangrui Song <maskray@google.com>

arm: Use internal symbol for _dl_argv on _dl_start_user

The lld does not support R_ARM_GOTOFF32 to preemptible symbol (_dl_argv
has default visibility). Use the internal alias instead (one option
would to use HIDDEN_JUMPTARGET, bu the macro is not defined for
!__ASSEMBLER__ and I made this patch arm-specific to avoid require to
check extensivelly on other architecture it this might break something).

Checked on arm-linux-gnueabihf.

Reviewed-by: Fangrui Song <maskray@google.com>

x86-64: Remove Prefer_AVX2_STRCMP

Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.

x86-64: Improve EVEX strcmp with masked load

In strcmp-evex.S, to compare 2 32-byte strings, replace

        VMOVU   (%rdi, %rdx), %YMM0
        VMOVU   (%rsi, %rdx), %YMM1
        /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
        VPCMP   $4, %YMM0, %YMM1, %k0
        VPCMP   $0, %YMMZERO, %YMM0, %k1
        VPCMP   $0, %YMMZERO, %YMM1, %k2
        /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
        kord    %k1, %k2, %k1
        /* Each bit in K1 represents a NULL or a mismatch.  */
        kord    %k0, %k1, %k1
        kmovd   %k1, %ecx
        testl   %ecx, %ecx
        jne     L(last_vector)

with

        VMOVU   (%rdi, %rdx), %YMM0
        VPTESTM %YMM0, %YMM0, %k2
        /* Each bit cleared in K1 represents a mismatch or a null CHAR
           in YMM0 and 32 bytes at (%rsi, %rdx).  */
        VPCMP   $0, (%rsi, %rdx), %YMM0, %k1{%k2}
        kmovd   %k1, %ecx
        incl    %ecx
        jne     L(last_vector)

It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.

Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>

benchtests: Add acosf function to bench-math

Add acosf function to bench-math and copy acosf-inputs to benchtests.
Motivation for this patch is to prepare for upcoming libmvec new
functions. Float and double version of libmvec functions stays
together.

acosf-inputs file generated from acos-inputs file using following
scaling formula:

f = d * (FLT_MAX/DBL_MAX)

Where d is input(double) and f is output(float). If scaled float value
is duplicate in new input file, nextafterf() function used to find next
float value, ensuring no duplicates.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

benchtests: Improve bench-memcpy-random

Improve the random memcpy benchmark. Double the number of tests and increase
the size of the memory region to test between 32KB and 1024KB. This improves
accuracy on modern cores. Clean up formatting of the frequency array.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>

Disable -Waggressive-loop-optimizations warnings in tst-dynarray.c

My build-many-glibcs.py bot shows -Waggressive-loop-optimizations
errors building the glibc testsuite for 32-bit architectures with GCC
mainline, which seem to have appeared between GCC commits
4abc0c196b10251dc80d0743ba9e8ab3e56c61ed and
d8edfadfc7a9795b65177a50ce44fd348858e844:

In function 'dynarray_long_noscratch_resize',
    inlined from 'test_long_overflow' at tst-dynarray.c:489:5,
    inlined from 'do_test' at tst-dynarray.c:571:3:
../malloc/dynarray-skeleton.c:391:36: error: iteration 1073741823 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
  391 |             DYNARRAY_ELEMENT_INIT (&list->u.dynarray_header.array[i]);
tst-dynarray.c:39:37: note: in definition of macro 'DYNARRAY_ELEMENT_INIT'
   39 | #define DYNARRAY_ELEMENT_INIT(e) (*(e) = 23)
      |                                     ^
In file included from tst-dynarray.c:42:
../malloc/dynarray-skeleton.c:389:37: note: within this loop
  389 |         for (size_t i = old_size; i < size; ++i)
      |                                   ~~^~~~~~
In function 'dynarray_long_resize',
    inlined from 'test_long_overflow' at tst-dynarray.c:479:5,
    inlined from 'do_test' at tst-dynarray.c:571:3:
../malloc/dynarray-skeleton.c:391:36: error: iteration 1073741823 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
  391 |             DYNARRAY_ELEMENT_INIT (&list->u.dynarray_header.array[i]);
tst-dynarray.c:27:37: note: in definition of macro 'DYNARRAY_ELEMENT_INIT'
   27 | #define DYNARRAY_ELEMENT_INIT(e) (*(e) = 17)
      |                                     ^
In file included from tst-dynarray.c:28:
../malloc/dynarray-skeleton.c:389:37: note: within this loop
  389 |         for (size_t i = old_size; i < size; ++i)
      |                                   ~~^~~~~~

I don't know what GCC change made these errors appear, or why they
only appear for 32-bit architectures.  However, the warnings appear to
be both true (that iteration would indeed involve undefined behavior
if executed) and useless in this particular case (that iteration is
never executed, because the allocation size overflows and so the
allocation fails - but the check for allocation size overflow is in a
separate source file and so can't be seen by the compiler when
compiling this test).  So use the DIAG_* macros to disable
-Waggressive-loop-optimizations around the calls in question to
dynarray_long_resize and dynarray_long_noscratch_resize in this test.

Tested with build-many-glibcs.py (GCC mainline) for arm-linux-gnueabi,
where it restores a clean testsuite build.

Fix compiler issue with mmap_internal

Compiling mmap_internal fails to compile when we use -1 for MMAP2_PAGE_UNIT
on 32 bit architectures.  The error is as follows:

    ../sysdeps/unix/sysv/linux/mmap_internal.h:30:8: error: unknown type
    name 'uint64_t'
    |
       30 | static uint64_t page_unit;
  |
  |        ^~~~~~~~

Fix by adding including stdint.h.

Check if linker also support -mtls-dialect=gnu2

Since some linkers (for instance lld for i386) does not support it
for all architectures.

Checked on i686-linux-gnu.

Reviewed-by: Fangrui Song <maskray@google.com>

Fix LIBC_PROG_BINUTILS for -fuse-ld=lld

GCC does not print the correct linker when -fuse-ld=lld is used with
the -print-prog-name=ld:

  $ gcc -v 2>&1 | tail -n 1
  gcc version 11.2.0 (Ubuntu 11.2.0-7ubuntu2)
  $ gcc
  ld

This is different than for gold:

  $ gcc -fuse-ld=gold -print-prog-name=ld
  ld.gold

Using ld.lld as the static linker name prints the expected result.

This is only required when -fuse-ld=lld is used, if lld is used as
the 'ld' programs (through a symlink) LIBC_PROG_BINUTILS works
as expected.

Checked on x86_64-linux-gnu.

Reviewed-by: Fangrui Song <maskray@google.com>

elf: Disable ifuncmain{1,5,5pic,5pie} when using LLD

These tests takes the address of a protected symbol (foo_protected)
and lld does not support copy relocations on protected data symbols.

Checked on x86_64-linux-gnu.

Reviewed-by: Fangrui Song <maskray@google.com>

Handle NULL input to malloc_usable_size [BZ #28506]

Hoist the NULL check for malloc_usable_size into its entry points in
malloc-debug and malloc and assume non-NULL in all callees. This fixes
BZ #28506

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>

x86_64: Add memcmpeq.S to fix disable-multi-arch build

The following commit:

    commit cf4fd28ea453d1a9cec93939bc88b58ccef5437a
    Author: Noah Goldstein <goldstein.w.n@gmail.com>
    Date:   Tue Oct 26 19:43:18 2021 -0500

Broke --disable-multi-arch build for x86_64 because x86_64/memcmpeq.S
was not defined outside of multiarch and the alias for __memcmpeq in
x86_64/memcmp.S was removed.

This commit fixes that issue by adding x86_64/memcmpeq.S.

make xcheck passes on x86_64 with and without --disable-multi-arch

login: Add back libutil as an empty library

There are several packages like sysvinit and buildroot that expect
-lutil to work. Rather than impacting them with having to change
the linker flags provide an empty libutil.a.

riscv: Fix incorrect jal with HIDDEN_JUMPTARGET

A non-local STV_DEFAULT defined symbol is by default preemptible in a
shared object. j/jal cannot target a preemptible symbol. On other
architectures, such a jump instruction either causes PLT [BZ #18822], or
if short-ranged, sometimes rejected by the linker (but not by GNU ld's
riscv port [ld PR/28509]).

Use HIDDEN_JUMPTARGET to target a non-preemptible symbol instead.

With this patch, ld.so and libc.so can be linked with LLD if source
files are compiled/assembled with -mno-relax/-Wa,-mno-relax.

Acked-by: Palmer Dabbelt <palmer@dabbelt.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S

No bug. This commit adds new optimized __memcmpeq implementation for
evex.

The primary optimizations are:

1) skipping the logic to find the difference of the first mismatched
byte.

2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.

x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S

No bug. This commit adds new optimized __memcmpeq implementation for
avx2.

The primary optimizations are:

1) skipping the logic to find the difference of the first mismatched
byte.

2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.

x86_64: Add sse2 optimized __memcmpeq in memcmp-sse2.S

No bug. This commit does not modify any of the memcmp
implementation. It just adds __memcmpeq ifdefs to skip obvious cases
where computing the proper 1/-1 required by memcmp is not needed.

x86_64: Add support for __memcmpeq using sse2, avx2, and evex

No bug. This commit adds support for __memcmpeq to be implemented
seperately from memcmp. Support is added for versions optimized with
sse2, avx2, and evex.

Benchtests: Add benchtests for __memcmpeq

No bug. This commit adds __memcmpeq benchmarks. The benchmarks just
use the existing ones in memcmp. This will be useful for testing
implementations of __memcmpeq that do not just alias memcmp.

String: Add __memcmpeq as build target

No bug. This commit just adds __memcmpeq as a build target so that
implementations for __memcmpeq that are not just aliases to memcmp can
be supported.

NEWS: Add item for __memcmpeq

String: Add tests for __memcmpeq

No bug.

This commit adds tests for the new function __memcmpeq. The new tests
use the existing tests in 'test-memcmp.c' but relax the result
requirement to only check for zero or non-zero returns.

All string tests include test-memcmpeq are passing.

String: Add hidden defs for __memcmpeq() to enable internal usage

No bug.

This commit adds hidden defs for all declarations of __memcmpeq. This
enables usage of __memcmpeq without the PLT for usage internal to
GLIBC.

String: Add support for __memcmpeq() ABI on all targets

No bug.

This commit adds support for __memcmpeq() as a new ABI for all
targets. In this commit __memcmpeq() is implemented only as an alias
to the corresponding targets memcmp() implementation. __memcmpeq() is
added as a new symbol starting with GLIBC_2.35 and defined in string.h
with comments explaining its behavior. Basic tests that it is callable
and works where added in string/tester.c

As discussed in the proposal "Add new ABI '__memcmpeq()' to libc"
__memcmpeq() is essentially a reserved namespace for bcmp(). The means
is shares the same specifications as memcmp() except the return value
for non-equal byte sequences is any non-zero value. This is less
strict than memcmp()'s return value specification and can be better
optimized when a boolean return is all that is needed.

__memcmpeq() is meant to only be called by compilers if they can prove
that the return value of a memcmp() call is only used for its boolean
value.

All tests in string/tester.c passed. As well build succeeds on
x86_64-linux-gnu target.

configure: Don't check LD -v --help for LIBC_LINKER_FEATURE

When LIBC_LINKER_FEATURE is used to check a linker option with the equal
sign, it will likely fail because the LD -v --help output may look like
`-z lam-report=[none|warning|error]` while the needle is something like
`-z lam-report=warning`.

The LD -v --help filter doesn't save much time, so just remove it.

elf: Make global.out depend on reldepmod4.so [BZ #28457]

The global test is linked with globalmod1.so which dlopens reldepmod4.so.
Make global.out depend on reldepmod4.so. This fixes BZ #28457.
Reviewed-by: Florian Weimer <fweimer@redhat.com>

x86: Replace sse2 instructions with avx in memcmp-evex-movbe.S

This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'.

it could potentially be dangerous to use SSE2 if this function is ever
called without using 'vzeroupper' beforehand. While compilers appear
to use 'vzeroupper' before function calls if AVX2 has been used, using
SSE2 here is more brittle. Since it is not absolutely necessary it
should be avoided.

It costs 2-extra bytes but the extra bytes should only eat into
alignment padding.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

bench-math: Sort and put each bench per line

Sort and put each math bench per line to prepare for new math benches.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>

x86_64: Add missing libmvec ABI tests

Add vector ABI tests for cos, exp, log, pow and sin functions.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

elf: Fix e6fd79f379 build with --enable-tunables=no

The _dl_sort_maps_init() is not defined when tunables is not enabled.

Checked on x86_64-linux-gnu.

elf: Fix slow DSO sorting behavior in dynamic loader (BZ #17645)

This second patch contains the actual implementation of a new sorting algorithm
for shared objects in the dynamic loader, which solves the slow behavior that
the current "old" algorithm falls into when the DSO set contains circular
dependencies.

The new algorithm implemented here is simply depth-first search (DFS) to obtain
the Reverse-Post Order (RPO) sequence, a topological sort. A new l_visited:1
bitfield is added to struct link_map to more elegantly facilitate such a search.

The DFS algorithm is applied to the input maps[nmap-1] backwards towards
maps[0]. This has the effect of a more "shallow" recursion depth in general
since the input is in BFS. Also, when combined with the natural order of
processing l_initfini[] at each node, this creates a resulting output sorting
closer to the intuitive "left-to-right" order in most cases.

Another notable implementation adjustment related to this _dl_sort_maps change
is the removing of two char arrays 'used' and 'done' in _dl_close_worker to
represent two per-map attributes. This has been changed to simply use two new
bit-fields l_map_used:1, l_map_done:1 added to struct link_map. This also allows
discarding the clunky 'used' array sorting that _dl_sort_maps had to sometimes
do along the way.

Tunable support for switching between different sorting algorithms at runtime is
also added. A new tunable 'glibc.rtld.dynamic_sort' with current valid values 1
(old algorithm) and 2 (new DFS algorithm) has been added. At time of commit
of this patch, the default setting is 1 (old algorithm).

Signed-off-by: Chung-Lin Tang <cltang@codesourcery.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

elf: Testing infrastructure for ld.so DSO sorting (BZ #17645)

This is the first of a 2-part patch set that fixes slow DSO sorting behavior in
the dynamic loader, as reported in BZ #17645. In order to facilitate such a
large modification to the dynamic loader, this first patch implements a testing
framework for validating shared object sorting behavior, to enable comparison
between old/new sorting algorithms, and any later enhancements.

This testing infrastructure consists of a Python script
scripts/dso-ordering-test.py' which takes in a description language, consisting
of strings that describe a set of link dependency relations between DSOs, and
generates testcase programs and Makefile fragments to automatically test the
described situation, for example:

  a->b->c->d          # four objects linked one after another

  a->[bc]->d;b->c     # a depends on b and c, which both depend on d,
                      # b depends on c (b,c linked to object a in fixed order)

  a->b->c;{+a;%a;-a}  # a, b, c serially dependent, main program uses
                      # dlopen/dlsym/dlclose on object a

  a->b->c;{}!->[abc]  # a, b, c serially dependent; multiple tests generated
                      # to test all permutations of a, b, c ordering linked
                      # to main program

(Above is just a short description of what the script can do, more
  documentation is in the script comments.)

Two files containing several new tests, elf/dso-sort-tests-[12].def are added,
including test scenarios for BZ #15311 and Redhat issue #1162810 [1].

Due to the nature of dynamic loader tests, where the sorting behavior and test
output occurs before/after main(), generating testcases to use
support/test-driver.c does not suffice to control meaningful timeout for ld.so.
Therefore a new utility program 'support/test-run-command', based on
test-driver.c/support_test_main.c has been added. This does the same testcase
control, but for a program specified through a command-line rather than at the
source code level. This utility is used to run the dynamic loader testcases
generated by dso-ordering-test.py.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1162810

Signed-off-by: Chung-Lin Tang <cltang@codesourcery.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

iconv: Use TIMEOUTFACTOR for iconv test timeout

Currently the timeout for each iconv test is hard coded to 3 seconds.
On my OpenRISC test platform this is too slow and the test fails with a
HANG error.

This change uses the available TIMEOUTFACTOR to compute the timeout.
The default value is still 3.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

posix: Remove alloca usage for internal fnmatch implementation

This patch replaces the internal fnmatch pattern list generation
to use a dynamic array.

Checked on x86_64-linux-gnu.

Add alloc_align attribute to memalign et al

GCC 4.9.0 added the alloc_align attribute to say that a function
argument specifies the alignment of the returned pointer. Clang supports
the attribute too. Using the attribute can allow a compiler to generate
better code if it knows the returned pointer has a minimum alignment.
See https://gcc.gnu.org/PR60092 for more details.

GCC implicitly knows the semantics of aligned_alloc and posix_memalign,
but not the obsolete memalign. As a result, GCC generates worse code
when memalign is used, compared to aligned_alloc. Clang knows about
aligned_alloc and memalign, but not posix_memalign.

This change adds a new __attribute_alloc_align__ macro to <sys/cdefs.h>
and then uses it on memalign (where it helps GCC) and aligned_alloc
(where GCC and Clang already know the semantics, but it doesn't hurt)
and xposix_memalign. It can't be used on posix_memalign because that
doesn't return a pointer (the allocated pointer is returned via a void**
parameter instead).

Unlike the alloc_size attribute, alloc_align only allows a single
argument. That means the new __attribute_alloc_align__ macro doesn't
really need to be used with double parentheses to protect a comma
between its arguments. For consistency with __attribute_alloc_size__
this patch defines it the same way, so that double parentheses are
required.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>

linux: Fix a possibly non-constant expression in _Static_assert

According to C11 6.6p6, `const int` as an operand may not make up a
constant expression. GCC -O0 errors:

../sysdeps/unix/sysv/linux/opendir.c:107:19: error: static_assert expression is not an integral constant expression
_Static_assert (allocation_size >= sizeof (struct dirent64),

-O2 -Wpedantic has a similar warning.
See https://gcc.gnu.org/PR102502 for GCC's inconsistency.

Use enum which is guaranteed to be a constant expression.
This also makes the file compilable with Clang.

Fixes: 4b962c9e859de23b461d61f860dbd3f21311e83a ("linux: Simplify opendir buffer allocation")

x86-64: Add sysdeps/x86_64/fpu/Makeconfig

1. Add sysdeps/x86_64/fpu/Makeconfig to auto-generate libmvec.mk, which
contains libmvec ABI test dependencies and CFLAGS, in the build directory.
2. Include libmvec.mk for libmvec ABI test dependencies and CFLAGS.

Tested on SSE4, AVX, AVX2 and AVX512 machines.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>

stdlib: Fix tst-canon-bz26341 when the glibc build current working directory is itself using symlinks.

powerpc: Remove backtrace implementation

The powerpc optimization to provide a fast stacktrace requires some
ad-hoc code to handle Linux signal frames and the change is fragile
once the kernel decides to slight change its execution sequence [1].

The generic implementation work as-is and it should be future proof
since the kernel provides the expected CFI directives in vDSO shared
page.

Checked on powerpc-linux-gnu, powerpc64le-linux-gnu, and
powerpc64-linux-gnu.

[1] https://sourceware.org/pipermail/libc-alpha/2021-January/122027.html

Correct access attribute on memfrob (bug 28475)

As noted in bug 28475, the access attribute on memfrob in <string.h>
is incorrect: the function both reads and writes the memory pointed to
by its argument, so it needs to use __read_write__, not
__write_only__.  This incorrect attribute results in a build failure
for accessing uninitialized memory for s390x-linux-gnu-O3 with
build-many-glibcs.py using GCC mainline.

Correct the attribute.  Fixing this shows up that some calls to
memfrob in elf/ tests are reading uninitialized memory; I'm not
entirely sure of the purpose of those calls, but guess they are about
ensuring that the stack space is indeed allocated at that point in the
function, and so it matters that they are calling a function whose
semantics are unknown to the compiler.  Thus, change the first memfrob
call in those tests to use explicit_bzero instead, as suggested by
Florian in
<https://sourceware.org/pipermail/libc-alpha/2021-October/132119.html>,
to avoid the use of uninitialized memory.

Tested for x86_64, and with build-many-glibcs.py (GCC mainline) for
s390x-linux-gnu-O3.

debug: Add tests for _FORTIFY_SOURCE=3

Add some testing coverage for _FORTIFY_SOURCE=3.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

Make sure that the fortified function conditionals are constant

In _FORTIFY_SOURCE=3, the size expression may be non-constant,
resulting in branches in the inline functions remaining intact and
causing a tiny overhead.  Clang (and in future, gcc) make sure that
the -1 case is always safe, i.e. any comparison of the generated
expression with (size_t)-1 is always false so that bit is taken care
of.  The rest is avoidable since we want the _chk variant whenever we
have a size expression and it's not -1.

Rework the conditionals in a uniform way to clearly indicate two
conditions at compile time:

- Either the size is unknown (-1) or we know at compile time that the
  operation length is less than the object size.  We can call the
  original function in this case.  It could be that either the length,
  object size or both are non-constant, but the compiler, through
  range analysis, is able to fold the *comparison* to a constant.

- The size and length are known and the compiler can see at compile
  time that operation length > object size.  This is valid grounds for
  a warning at compile time, followed by emitting the _chk variant.

For everything else, emit the _chk variant.

This simplifies most of the fortified function implementations and at
the same time, ensures that only one call from _chk or the regular
function is emitted.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

Don't add access size hints to fortifiable functions

In the context of a function definition, the size hints imply that the
size of an object pointed to by one parameter is another parameter.
This doesn't make sense for the fortified versions of the functions
since that's the bit it's trying to validate.

This is harmless with __builtin_object_size since it has fairly simple
semantics when it comes to objects passed as function parameters.
With __builtin_dynamic_object_size we could (as my patchset for gcc[1]
already does) use the access attribute to determine the object size in
the general case but it misleads the fortified functions.

Basically the problem occurs when access attributes are present on
regular functions that have inline fortified definitions to generate
_chk variants; the attributes get inherited by these definitions,
causing problems when analyzing them.  For example with poll(fds, nfds,
timeout), nfds is hinted using the __attr_access as being the size of
fds.

Now, when analyzing the inline function definition in bits/poll2.h, the
compiler sees that nfds is the size of fds and tries to use that
information in the function body.  In _FORTIFY_SOURCE=3 case, where the
object size could be a non-constant expression, this information results
in the conclusion that nfds is the size of fds, which defeats the
purpose of the implementation because we're trying to check here if nfds
does indeed represent the size of fds.  Hence for this case, it is best
to not have the access attribute.

With the attributes gone, the expression evaluation should get delayed
until the function is actually inlined into its destinations.

Disable the access attribute for fortified function inline functions
when building at _FORTIFY_SOURCE=3 to make this work better.  The
access attributes remain for the _chk variants since they can be used
by the compiler to warn when the caller is passing invalid arguments.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581125.html

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>

glibcextract.py: Place un-assemblable @@@ in a comment

Unlike GCC, Clang parses asm statements and verifies they are valid
instructions/directives. Place the magic @@@ into a comment to avoid
a parse error.

nss: Unnest nested function add_key

This makes makedb.c compilable with Clang which does not support nested
functions.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

ld.so: Initialize bootstrap_map.l_ld_readonly [BZ #28340]

1. Define DL_RO_DYN_SECTION to initalize bootstrap_map.l_ld_readonly
before calling elf_get_dynamic_info to get dynamic info in bootstrap_map,
2. Define a single

static inline bool
dl_relocate_ld (const struct link_map *l)
{
/* Don't relocate dynamic section if it is readonly */
return !(l->l_ld_readonly || DL_RO_DYN_SECTION);
}

This updates BZ #28340 fix.

timex: Use 64-bit fields on 32-bit TIMESIZE=64 systems (BZ #28469)

This was found when testing the OpenRISC port I am working on.  These
two tests fail with SIGSEGV:

  FAIL: misc/tst-ntp_gettime
  FAIL: misc/tst-ntp_gettimex

This was found to be due to the kernel overwriting the stack space
allocated by the timex structure.  The reason for the overwrite being
that the kernel timex has 64-bit fields and user space code only
allocates enough stack space for timex with 32-bit fields.

On 32-bit systems with TIMESIZE=64 __USE_TIME_BITS64 is not defined.
This causes the timex structure to use 32-bit fields with type
__syscall_slong_t.

This patch adjusts the ifdef condition to allow 32-bit systems with
TIMESIZE=64 to use the 64-bit long long timex definition.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

manual: Update _TIME_BITS to clarify it's user defined

The current language reads "This macro determines...", changing to
"Define this macro...". This is consistent with other feature macro
documentation language.

When I first read the previous language it seems to indicate that the
macro is already defined. By changing the language to "Define this
macro..." it's clear that its the user's responsibility to define it.

nptl: Fix tst-cancel7 and tst-cancelx7 pidfile race

The check for waiting for the pidfile to be created looks wrong. At the
point when ACCESS is run the pid file will always be created and
accessible as it is created during DO_PREPARE. This means that thread
cancellation may be performed before the pid is written to the pidfile.

This was found to be flaky when testing on my OpenRISC platform.

Fix this by using the semaphore to wait for pidfile pid write
completion.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

elf: Fix elf_get_dynamic_info() for bootstrap

THe d6d89608ac8c broke powerpc for --enable-bind-now because it turned
out that different than patch assumption rtld elf_get_dynamic_info()
does require to handle RTLD_BOOTSTRAP to avoid DT_FLAGS and
DT_RUNPATH (more specially the GLRO usage which is not reallocate
yet).

This patch fixes by passing two arguments to elf_get_dynamic_info()
to inform that by rtld (bootstrap) or static pie initialization
(static_pie_bootstrap). I think using explicit argument is way more
clear and burried C preprocessor, and compiler should remove the
dead code.

I checked on x86_64 and i686 with default options, --enable-bind-now,
and --enable-bind-now and --enable--static-pie. I also check on
aarch64, armhf, powerpc64, and powerpc with default and
--enable-bind-now.

hurd if_index: Explicitly use AF_INET for if index discovery

5bf07e1b3a74 ("Linux: Simplify __opensock and fix race condition [BZ #28353]")
made __opensock try NETLINK then UNIX then INET. On the Hurd, only INET
knows about network interfaces, so better actually specify that in
if_index.

hurd: Fix intr-msg parameter/stack kludge

INTR_MSG_TRAP was tinkering with esp to make it point to
_hurd_intr_rpc_mach_msg's parameters, and notably use (&msg)[-1] which is
meaningless in C.

Instead, just push the parameters on the stack, which also avoids leaving
local variables of _hurd_intr_rpc_mach_msg below esp. We now also
properly express that OPTION and TIMEOUT may be updated during the trap
call.

x86-64: Add test-vector-abi.h/test-vector-abi-sincos.h

Add templates for vector ABI test and use them for vector sincos/sincosf
ABI tests.

elf: Fix dynamic-link.h usage on rtld.c

The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on
rtld.c due two issues:

  1. RTLD_BOOTSTRAP is also used on dl-machine.h on various
     architectures and it changes the semantics of various machine
     relocation functions.

  2. The elf_get_dynamic_info() change was done sideways, previously
     to 490e6c62aa get-dynamic-info.h was included by the first
     dynamic-link.h include *without* RTLD_BOOTSTRAP being defined.
     It means that the code within elf_get_dynamic_info() that uses
     RTLD_BOOTSTRAP is in fact unused.

To fix 1. this patch now includes dynamic-link.h only once with
RTLD_BOOTSTRAP defined.  The ELF_DYNAMIC_RELOCATE call will now have
the relocation fnctions with the expected semantics for the loader.

And to fix 2. part of 4af6982e4c is reverted (the check argument
elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP
pieces are removed.

To reorganize the includes the static TLS definition is moved to
its own header to avoid a circular dependency (it is defined on
dynamic-link.h and dl-machine.h requires it at same time other
dynamic-link.h definition requires dl-machine.h defitions).

Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL
are moved to its own header.  Only ancient ABIs need special values
(arm, i386, and mips), so a generic one is used as default.

The powerpc Elf64_FuncDesc is also moved to its own header, since
csu code required its definition (which would require either include
elf/ folder or add a full path with elf/).

Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32,
and powerpc64le.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

x86: Optimize memset-vec-unaligned-erms.S

No bug.

Optimization are

1. change control flow for L(more_2x_vec) to fall through to loop and
   jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
   size and saves jumps for length > 4x VEC_SIZE.

2. For EVEX/AVX512 move L(less_vec) closer to entry.

3. Avoid complex address mode for length > 2x VEC_SIZE

4. Slightly better aligning code for the loop from the perspective of
   code size and uops.

5. Align targets so they make full use of their fetch block and if
   possible cache line.

6. Try and reduce total number of icache lines that will need to be
   pulled in for a given length.

7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
   jumping to the stosb target in the sse2 code section will almost
   certainly be to a new page. The new version does increase code size
   marginally by duplicating the target but should get better iTLB
   behavior as a result.

test-memset, test-wmemset, and test-bzero are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

x86: Optimize memcmp-evex-movbe.S for frontend behavior and size

No bug.

The frontend optimizations are to:
1. Reorganize logically connected basic blocks so they are either in
the same cache line or adjacent cache lines.
2. Avoid cases when basic blocks unnecissarily cross cache lines.
3. Try and 32 byte align any basic blocks possible without sacrificing
code size. Smaller / Less hot basic blocks are used for this.

Overall code size shrunk by 168 bytes. This should make up for any
extra costs due to aligning to 64 bytes.

In general performance before deviated a great deal dependending on
whether entry alignment % 64 was 0, 16, 32, or 48. These changes
essentially make it so that the current implementation is at least
equal to the best alignment of the original for any arguments.

The only additional optimization is in the page cross case. Branch on
equals case was removed from the size == [4, 7] case. As well the [4,
7] and [2, 3] case where swapped as [4, 7] is likely a more hot
argument size.

test-memcmp and test-wmemcmp are both passing.

libio: Update tst-wfile-sync to not depend on stdin

The test expects stdin to be a file which is not the case when running
tests over ssh where stdin is piped in.

The test fails with:
error: xlseek.c:27: lseek64 (0, 0, 1): Illegal seek

Update the test to create a temporary file and use that to perform the
test.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

elf: Update audit tests to not depend on stdout

The tst-audit14, tst-audit15 and tst-audit16 tests all have audit
modules that write to stdout; the test reads from stdout to confirm
what was written.  This assumes the stdout is a file which is not the
case when run over ssh.

This patch updates the tests to use a post run cmp command to compare
the output against and .exp file.  This is similar to how many other
tests work and it fixes the stdout limitation.  Also, this means the
test code can be greatly simplified.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

elf: Fix elf_get_dynamic_info definition

Before to 490e6c62aa31a8a ('elf: Avoid nested functions in the loader
[BZ #27220]'), elf_get_dynamic_info() was defined twice on rtld.c: on
the first dynamic-link.h include and later within _dl_start(). The
former definition did not define DONT_USE_BOOTSTRAP_MAP and it is used
on setup_vdso() (since it is a global definition), while the former does
define DONT_USE_BOOTSTRAP_MAP and it is used on loader self-relocation.

With the commit change, the function is now included and defined once
instead of defined as a nested function. So rtld.c defines without
defining RTLD_BOOTSTRAP and it brokes at least powerpc32.

This patch fixes by moving the get-dynamic-info.h include out of
dynamic-link.h, which then the caller can corirectly set the expected
semantic by defining STATIC_PIE_BOOTSTRAP, RTLD_BOOTSTRAP, and/or
RESOLVE_MAP.

It also required to enable some asserts only for the loader bootstrap
to avoid issues when called from setup_vdso().

As a side note, this is another issues with nested functions: it is
not clear from pre-processed output (-E -dD) how the function will
be build and its semantic (since nested function will be local and
extra C defines may change it).

I checked on x86_64-linux-gnu (w/o --enable-static-pie),
i686-linux-gnu, powerpc64-linux-gnu, powerpc-linux-gnu-power4,
aarch64-linux-gnu, arm-linux-gnu, sparc64-linux-gnu, and
s390x-linux-gnu.

Reviewed-by: Fangrui Song <maskray@google.com>