Bug 26643

Summary:	register x8 and quad sized NEON registers are not properly saved when using ld_audit on aarch64
Product:	glibc	Reporter:	Ben Woodard <woodard>
Component:	dynamic-link	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED FIXED
Severity:	normal	CC:	nnye, nsz
Priority:	P2
Version:	unspecified
Target Milestone:	2.35
Host:		Target:
Build:		Last reconfirmed:
Bug Depends on:
Bug Blocks:	28366
Attachments:	reproducer that demonstrates that x8 is not being saved. Patch to fix the x8 problem and the NEON quad sized vector register problems V2 of the patch Second V2 version of the patch

Description Ben Woodard 2020-09-22 06:09:30 UTC

Created attachment 12853 [details]
reproducer that demonstrates that x8 is not being saved.

Description of problem:
According to the Arm Cortex-A Series Programmers Guide for ARMv8-A It is part of the ABI for the architecture.

See Section 9.1.1 Parameters in general-purpose registers on page 9-3 where it states:

   Registers with a special purpose (X8, X16-X18, X29, X30)
   • X8 is the indirect result register. This is used to pass the address location of an indirect result, for example, where a function returns a large structure.gister should be preserved. 

You can see that this register is saved in the normal code path which is what is normally run. See sysdeps/aarch64/dl-trampoline.S:48 :
	/* Save arguments.  */
	stp	x8, x9, [sp, #-(80+8*16)]!
	cfi_adjust_cfa_offset (80+8*16)
	cfi_rel_offset (x8, 0)
	cfi_rel_offset (x9, 8)
However, when you use LD_AUDIT the rtld shifts into what it calls profiling mode. In that profiling mode x8 is not properly saved. See there is no x8 being saved around line 185 in _dl_runtime_profile:

	stp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
	cfi_rel_offset (x0, OFFSET_RG + DL_OFFSET_RG_X0 + 16*0 + 0)
	cfi_rel_offset (x1, OFFSET_RG + DL_OFFSET_RG_X0 + 16*0 + 8)
	stp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
	cfi_rel_offset (x2, OFFSET_RG + DL_OFFSET_RG_X0 + 16*1 + 0)
	cfi_rel_offset (x3, OFFSET_RG + DL_OFFSET_RG_X0 + 16*1 + 8)
	stp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
	cfi_rel_offset (x4, OFFSET_RG + DL_OFFSET_RG_X0 + 16*2 + 0)
	cfi_rel_offset (x5, OFFSET_RG + DL_OFFSET_RG_X0 + 16*2 + 8)
	stp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
	cfi_rel_offset (x6, OFFSET_RG + DL_OFFSET_RG_X0 + 16*3 + 0)
	cfi_rel_offset (x7, OFFSET_RG + DL_OFFSET_RG_X0 + 16*3 + 8)

Yeah you got the Parameter and result registers but where is the Indirect result location register? After that block of code, you immediately start saving the d registers.

When you start looking at the code in _dl_runtime_profile it looks like it has been copied from a previous version of the processor where there were no Q registers in the V register group. In other words the V registers were only uint64_t's not uint128_t's. 

x29 ends up being loaded with a pointer to a struct of type la_aarch64_regs 

/* Registers for entry into PLT on AArch64.  */
typedef struct La_aarch64_regs
{
  uint64_t lr_xreg[8];
  uint64_t lr_dreg[8];
  uint64_t lr_sp;
  uint64_t lr_lr;
} La_aarch64_regs;

This doesn't include space for x8 the indirect result register. Nor does it have space for the full sized v registers. Furthermore when you look at line 136 where it describes the stack frame layout

	   Stack frame layout:
	   [sp,   #...] lr
	   [sp,   #...] &PLTGOT[n]
	   [sp,    #96] La_aarch64_regs
	   [sp,    #48] La_aarch64_retval
	   [sp,    #40] frame size return from pltenter
	   [sp,    #32] dl_profile_call saved x1
	   [sp,    #24] dl_profile_call saved x0
	   [sp,    #16] t1
	   [sp,     #0] x29, lr   <- x29

what _dl_runtime_resolve actually pushes on the stack the La_aarch64_regs part doesn't match the what _dl_runtime_profile is expecting. The size of the structure is quite different.

To be able to support those the structure would have to be:

/* Registers for entry into PLT on AArch64.  */
typedef struct La_aarch64_regs
{
  uint64_t    lr_xreg[10];
  __uint128_t lr_vreg[8]; /* or uint64_t lr_vreg[16]; */
  uint64_t    lr_sp;
  uint64_t    lr_lr;
} La_aarch64_regs;

Version-Release number of selected component (if applicable):
All - this bug has been there since the beginning of the aarch64 port

How reproducible:
Always

Steps to Reproduce:
There are two reproducers in the uploaded tar.gz which I got for the original reporter. One is very simple and should work but variations in the compiler can change how the structure is returned and cause that reproducer not to work. The other reproducer is a much bigger bit of code which always forces the use of x8 but it is much more code.


Additional info:
This bug report was provided by developers at Rice University Jonathon Anderson, John Mellor-Crummey, Xiaozhu Meng, Mark Krentel working on behalf of LLNL

Comment 1 Ben Woodard 2020-09-22 06:18:08 UTC

Created attachment 12854 [details]
Patch to fix the x8 problem and the NEON quad sized vector register problems

his patch was tested to work on the x8 reproducer attached to this bug.

$ LD_AUDIT=../break-glibc/arm-audit-bug/simple/libauditor.so ../break-glibc/arm-audit-bug/simple/main
Segmentation fault (core dumped)
$ builddir=`pwd` LD_AUDIT=../break-glibc/arm-audit-bug/simple/libauditor.so CONV_PATH="${builddir}"/iconvdata LOCPATH="${builddir}"/localedata LC_ALL=C  "${builddir}"/elf/ld-linux-aarch64.so.1 --library-path "${builddir}":"${builddir}"/math:"${builddir}"/elf:"${builddir}"/dlfcn:"${builddir}"/nss:"${builddir}"/nis:"${builddir}"/rt:"${builddir}"/resolv:"${builddir}"/mathvec:"${builddir}"/support:"${builddir}"/crypt:"${builddir}"/nptl ../break-glibc/arm-audit-bug/simple/main
$

The NEON Vector register size problem is also fixed but a reproducer has not been developed. The analogous SVE problem is not yet solved. I do not think that we want to be pushing 7*2048 bits of z registers and 3*256 bits of p registers into the La_aarch64_regs structure each and every call. So I do not know exactly how that should be addressed.

Comment 2 Ben Woodard 2020-09-23 01:07:15 UTC

Created attachment 12856 [details]
V2 of the patch

Fixes added to the La_aarch64_retval structure.

Comment 3 Ben Woodard 2020-09-23 18:21:41 UTC

Created attachment 12859 [details]
Second V2 version of the patch

Patch numbering system kind of got jumbled this is the 2nd version of the patch that I sent upstream. It incorporates feedback from Szabolcs Nagy

Comment 4 Nathan Nye 2021-10-05 21:17:41 UTC

This looks related to https://sourceware.org/bugzilla/show_bug.cgi?id=28366

Comment 5 Ben Woodard 2021-10-05 21:24:42 UTC

Proposed patch: https://patchwork.sourceware.org/project/glibc/patch/20210730194715.881900-21-adhemerval.zanella@linaro.org/

Comment 6 Szabolcs Nagy 2022-03-25 16:33:55 UTC

fixed for 2.35 in commit ce9a68c57c260c8417afc93972849ac9ad243ec4

https://sourceware.org/git/?p=glibc.git;a=commit;h=ce9a68c57c260c8417afc93972849ac9ad243ec4

elf: Fix runtime linker auditing on aarch64 (BZ #26643)

The rtld audit support show two problems on aarch64:

1. _dl_runtime_resolve does not preserve x8, the indirect result
location register, which might generate wrong result calls
depending of the function signature.

2. The NEON Q registers pushed onto the stack by _dl_runtime_resolve
were twice the size of D registers extracted from the stack frame by
_dl_runtime_profile.

While 2. might result in wrong information passed on the PLT tracing,
1. generates wrong runtime behaviour.

The aarch64 rtld audit support is changed to:

* Both La_aarch64_regs and La_aarch64_retval are expanded to include
both x8 and the full sized NEON V registers, as defined by the
ABI.

* dl_runtime_profile needed to extract registers saved by
_dl_runtime_resolve and put them into the new correctly sized
La_aarch64_regs structure.

* The LAV_CURRENT check is change to only accept new audit modules
to avoid the undefined behavior of not save/restore x8.

* Different than other architectures, audit modules older than
LAV_CURRENT are rejected (both La_aarch64_regs and La_aarch64_retval
changed their layout and there are no requirements to support multiple
audit interface with the inherent aarch64 issues).

* A new field is also reserved on both La_aarch64_regs and
La_aarch64_retval to support variant pcs symbols.

Similar to x86, a new La_aarch64_vector type to represent the NEON
register is added on the La_aarch64_regs (so each type can be accessed
directly).

Since LAV_CURRENT was already bumped to support bind-now, there is
no need to increase it again.

Checked on aarch64-linux-gnu.

Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>