[PATCH] Fix runtime linker auditing on aarch64

Wed Sep 23 18:00:40 GMT 2020

On Wed, Sep 23, 2020 at 13:34, Szabolcs Nagy <szabolcs.nagy@arm.com> 
wrote:
>> 
>>  An additional problem not addressed by this patch is what to do 
>> about the
>>  changes to the aarch64 ABI needed to support SVE. A discussion 
>> about what to
>>  do about that was begun on libc-alpha here:
>>  
>> <https://sourceware.org/pipermail/libc-alpha/2020-September/117797.html>
> 
> SVE calls are special (marked as STO_AARCH64_VARIANT_PCS
> in the dynamic symbol table) and currently load time
> bound (like BIND_NOW) so i think they don't go through
> the PLT0 sequence that calls the profile entry in ld.so
> and thus audit hooks are not called for them.
> 
> this is probably not what LD_AUDIT users would want
> (do they care about hooking into sve calls?), but

Most certainly!!
The two big users of LD_AUDIT that I often times work with are:
spindle https://computing.llnl.gov/projects/spindle and hpctoolkit 
http://hpctoolkit.org/ I was talking to a colleague over at cray/HPE 
where they are selling the Apollo 80 which use the A64FX processor 
which has SVE and he immediately mentioned spindle and said that he 
would talk to the cray team that handles spindle. So this problem with 
SVE may not have been reported by users yet but it is just barely over 
the horizon.

> VARIANT_PCS essentially allows any call convention,

That seems like a really odd choice to me. You guys over at ARM did 
publish a AAPCS for processors with SVE defining which registers need 
to be saved vs which could be clobbered.
https://developer.arm.com/documentation/100986/0000/?search=5eec7447e24a5e02d07b2774

I've been trying to figure out how to handle this. I kind of needed to 
know what the structure of the La_aarch64_{regs,retval} before I really 
dug deeply into this.  If we could just define the how you want to 
present the structures, I will update my auditing patch to use that 
structure and then we can figure out how to fill later.

The pieces that I have but hadn't quite put together are:

- There is HWCAP_SVE in sysdeps/unix/sysv/linux/aarch64/bits/hwcap.h As 
I understand it so far, this is an ELF attribute that the dynamic 
linker could use when picking which version of a library to load. If 
this were propagated into the dynamic linker, we could tell at at least 
a library level if any functions in this library used SVE. If it didn't 
then we wouldn't have to save SVE registers.
    - I haven't checked if libraries do in fact set this HWCAP flag. If 
they don't then that is probably a binutils issue.
    - I haven't yet figured out how to find and make use of this 
library flag where I need it in the _dl_runtime_resolve and 
_dl_runtime_profile. If they are not relatively easy to find, I 
considered pushing the processor's hwcap and the library's hwcap onto 
the stack from the C code which calls those functions.
    - If the library's hwcap flags are not available in an audit 
library, we need to define a way that they are made available because 
they are needed to make the correct library selection purposes.  
Consider for example the case where an audit lib is trying to interpose 
its selection of a library over the normal selection made by the 
runtime linker. It needs to be able to evaluate if the hardware has all 
capabilities that the library requires.
    - Similarly, if a library does not make use of any of the NEON 
registers, then we may be able to get away with just saving the X regs 
and skipping the V regs. They are less of a problem than the 
potentially 2048 bit Z regs but it adds up.

However all of that happens at the library level, since the functions 
are of a different type in the dynamic section, it would be even better 
to make the decision about what needs to be saved at a per-function 
level.

I noted that I can potentially use ID_AA64PFR0_EL1 [35:32] to see if 
the processor has SVE then read ZCR_EL1[0:3] to figure out the internal 
register length of the SVE registers.
Then I would use ld1 and st1 to copy that length to a char[] or 
something.

Admittedly that is all less than half baked, which is why I wanted to 
agree on a La_aarch64_{regs,retval} structure first.

> so all registers have to be saved and restored if
> such call enters the dynamic linker which is a problem
> if register state may be extended in the future
> (although probably ldaudit is special enough that its
> users can update glibc if they care about new regs?).
> 
> (one way to expose variant pcs calls to audit hooks
> is to detect the symbol type in the asm entry point
> and then call a different hook, but this sounds
> sufficiently ugly that i think we would then prefer
> to update the elf abi with a second plt entry sequence
> for variant pcs calls that can just use a different
> entry into ld.so and new linkers would generate that
> for dsos with variant pcs symbols.)
> 
>>  ---
>>   sysdeps/aarch64/bits/link.h     | 17 ++++----
>>   sysdeps/aarch64/dl-link.sym     |  4 +-
>>   sysdeps/aarch64/dl-trampoline.S | 75 
>> +++++++++++++++++++++------------
>>   3 files changed, 59 insertions(+), 37 deletions(-)
>> 
>>  diff --git a/sysdeps/aarch64/bits/link.h 
>> b/sysdeps/aarch64/bits/link.h
>>  index 0c54e6ea7b..2b43ace57c 100644
>>  --- a/sysdeps/aarch64/bits/link.h
>>  +++ b/sysdeps/aarch64/bits/link.h
>>  @@ -23,19 +23,20 @@
>>   /* Registers for entry into PLT on AArch64.  */
>>   typedef struct La_aarch64_regs
>>   {
>>  -  uint64_t lr_xreg[8];
>>  -  uint64_t lr_dreg[8];
>>  -  uint64_t lr_sp;
>>  -  uint64_t lr_lr;
>>  +  uint64_t    lr_xreg[9];
>>  +  __uint128_t lr_vreg[8];
>>  +  uint64_t    lr_sp;
>>  +  uint64_t    lr_lr;
>>   } La_aarch64_regs;
> 
> ok.
> 
> changing abi is fine with me: old abi was
> unusably broken, if audit modules use some
> versioning that's even better.
> 
>> 
>>   /* Return values for calls from PLT on AArch64.  */
>>   typedef struct La_aarch64_retval
>>   {
>>  -  /* Up to two integer registers can be used for a return value.  
>> */
>>  -  uint64_t lrv_xreg[2];
>>  -  /* Up to four D registers can be used for a return value.  */
>>  -  uint64_t lrv_dreg[4];
>>  +  /* Up to eight integer registers and the indirect result 
>> location register
>>  +     can be used for a return value.  */
>>  +  uint64_t    lrv_xreg[9];
> 
> x8 is not preserved so recording it at function exit
> is not useful. (on entry it points to where results
> are stored but on exit it can be clobbered)
> 
>>  +  /* Up to eight V registers can be used for a return value.  */
>>  +  __uint128_t lrv_vreg[8];
>> 
>>   } La_aarch64_retval;
>>   __BEGIN_DECLS
> 
> note: i don't like to use non-standard types in
> public apis (like __uint128_t), but we already
> made this mistake in the linux sigcontext, so this
> is probably ok.
> 
> (my preference normally is to use a standard type
> e.g. long double or char[] with alignment attr,
> but in practice __uint128_t is probably easier to
> deal with)
> 
> 
>>  diff --git a/sysdeps/aarch64/dl-link.sym 
>> b/sysdeps/aarch64/dl-link.sym
>>  index d67d28b40c..70d153a1d5 100644
>>  --- a/sysdeps/aarch64/dl-link.sym
>>  +++ b/sysdeps/aarch64/dl-link.sym
>>  @@ -7,9 +7,9 @@ DL_SIZEOF_RG		sizeof(struct La_aarch64_regs)
>>   DL_SIZEOF_RV		sizeof(struct La_aarch64_retval)
>> 
>>   DL_OFFSET_RG_X0		offsetof(struct La_aarch64_regs, lr_xreg)
>>  -DL_OFFSET_RG_D0		offsetof(struct La_aarch64_regs, lr_dreg)
>>  +DL_OFFSET_RG_V0		offsetof(struct La_aarch64_regs, lr_vreg)
>>   DL_OFFSET_RG_SP		offsetof(struct La_aarch64_regs, lr_sp)
>>   DL_OFFSET_RG_LR		offsetof(struct La_aarch64_regs, lr_lr)
>> 
>>   DL_OFFSET_RV_X0		offsetof(struct La_aarch64_retval, lrv_xreg)
>>  -DL_OFFSET_RV_D0		offsetof(struct La_aarch64_retval, lrv_dreg)
>>  +DL_OFFSET_RV_V0		offsetof(struct La_aarch64_retval, lrv_vreg)
> 
> ok.
> 
>>  diff --git a/sysdeps/aarch64/dl-trampoline.S 
>> b/sysdeps/aarch64/dl-trampoline.S
>>  index 794876fffa..c91341e8fc 100644
>>  --- a/sysdeps/aarch64/dl-trampoline.S
>>  +++ b/sysdeps/aarch64/dl-trampoline.S
>>  @@ -46,6 +46,8 @@ _dl_runtime_resolve:
>>   	cfi_rel_offset (lr, 8)
>> 
>>   	/* Save arguments.  */
>>  +	/* Note: Saving x9 is not required by the ABI but the assember 
>> requires
>>  +	   the immediate values of operand 3 to be a multiple of 16 */
>>   	stp	x8, x9, [sp, #-(80+8*16)]!
>>   	cfi_adjust_cfa_offset (80+8*16)
>>   	cfi_rel_offset (x8, 0)
> 
> ok.
> 
>>  @@ -183,19 +185,23 @@ _dl_runtime_profile:
> 
> there is a comment at the entry point with offsets
> which i think is no longer valid (i think it's
> ok to remove the offsets just document the order
> of things on the stack)
> 
>>   	stp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
>>   	cfi_rel_offset (x6, OFFSET_RG + DL_OFFSET_RG_X0 + 16*3 + 0)
>>   	cfi_rel_offset (x7, OFFSET_RG + DL_OFFSET_RG_X0 + 16*3 + 8)
>>  -
>>  -	stp	d0, d1, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*0]
>>  -	cfi_rel_offset (d0, OFFSET_RG + DL_OFFSET_RG_D0 + 16*0)
>>  -	cfi_rel_offset (d1, OFFSET_RG + DL_OFFSET_RG_D0 + 16*0 + 8)
>>  -	stp	d2, d3, [X29, #OFFSET_RG+ DL_OFFSET_RG_D0 + 16*1]
>>  -	cfi_rel_offset (d2, OFFSET_RG + DL_OFFSET_RG_D0 + 16*1 + 0)
>>  -	cfi_rel_offset (d3, OFFSET_RG + DL_OFFSET_RG_D0 + 16*1 + 8)
>>  -	stp	d4, d5, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*2]
>>  -	cfi_rel_offset (d4, OFFSET_RG + DL_OFFSET_RG_D0 + 16*2 + 0)
>>  -	cfi_rel_offset (d5, OFFSET_RG + DL_OFFSET_RG_D0 + 16*2 + 8)
>>  -	stp	d6, d7, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*3]
>>  -	cfi_rel_offset (d6, OFFSET_RG + DL_OFFSET_RG_D0 + 16*3 + 0)
>>  -	cfi_rel_offset (d7, OFFSET_RG + DL_OFFSET_RG_D0 + 16*3 + 8)
>>  +	str	x8, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*4 + 0]
>>  +	cfi_rel_offset (x8, OFFSET_RG + DL_OFFSET_RG_X0 + 16*4 + 0)
>>  +	/* Note X9 is in the stack frame for alignment but it is not
>>  +	   required to be saved by the ABI */
>>  +
> 
> i dont see x9 here. you can just note that there is padding.
> 
>>  +	stp	q0, q1, [X29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*0]
>>  +	cfi_rel_offset (q0, OFFSET_RG + DL_OFFSET_RG_V0 + 32*0)
>>  +	cfi_rel_offset (q1, OFFSET_RG + DL_OFFSET_RG_V0 + 32*0 + 16)
>>  +	stp	q2, q3, [X29, #OFFSET_RG+ DL_OFFSET_RG_V0 + 32*1]
>>  +	cfi_rel_offset (q2, OFFSET_RG + DL_OFFSET_RG_V0 + 32*1 + 0)
>>  +	cfi_rel_offset (q3, OFFSET_RG + DL_OFFSET_RG_V0 + 32*1 + 16)
>>  +	stp	q4, q5, [X29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*2]
>>  +	cfi_rel_offset (q4, OFFSET_RG + DL_OFFSET_RG_V0 + 32*2 + 0)
>>  +	cfi_rel_offset (q5, OFFSET_RG + DL_OFFSET_RG_V0 + 32*2 + 16)
>>  +	stp	q6, q7, [X29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*3]
>>  +	cfi_rel_offset (q6, OFFSET_RG + DL_OFFSET_RG_V0 + 32*3 + 0)
>>  +	cfi_rel_offset (q7, OFFSET_RG + DL_OFFSET_RG_V0 + 32*3 + 16)
>> 
>>   	add     x0, x29, #SF_SIZE + 16
>>   	ldr	x1, [x29, #OFFSET_LR]
>>  @@ -234,10 +240,11 @@ _dl_runtime_profile:
>>   	ldp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
>>   	ldp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
>>   	ldp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
>>  -	ldp	d0, d1, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*0]
>>  -	ldp	d2, d3, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*1]
>>  -	ldp	d4, d5, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*2]
>>  -	ldp	d6, d7, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*3]
>>  +	ldr	x8, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*4]
>>  +	ldp	q0, q1, [x29, #OFFSET_RG + DL_OFFSET_RG_V0 + 16*0]
>>  +	ldp	q2, q3, [x29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*1]
>>  +	ldp	q4, q5, [x29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*2]
>>  +	ldp	q6, q7, [x29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*3]
>> 
>>   	cfi_def_cfa_register (sp)
>>   	ldp	x29, x30, [x29, #0]
>>  @@ -280,14 +287,21 @@ _dl_runtime_profile:
>>   	ldp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
>>   	ldp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
>>   	ldp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
>>  -	ldp	d0, d1, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*0]
>>  -	ldp	d2, d3, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*1]
>>  -	ldp	d4, d5, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*2]
>>  -	ldp	d6, d7, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*3]
>>  +	ldr	x8, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*4]
>>  +	ldp	q0, q1, [x29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*0]
>>  +	ldp	q2, q3, [x29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*1]
>>  +	ldp	q4, q5, [x29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*2]
>>  +	ldp	q6, q7, [x29, #OFFSET_RG + DL_OFFSET_RG_V0 + 32*3]
>>   	blr	ip0
> 
> ok.
> 
>>  -	stp	x0, x1, [x29, #OFFSET_RV + DL_OFFSET_RV_X0]
>>  -	stp	d0, d1, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*0]
>>  -	stp	d2, d3, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*1]
>>  +	stp	x0, x1, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*0]
>>  +	stp	x2, x3, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*1]
>>  +	stp	x4, x5, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*2]
>>  +	stp	x6, x7, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*3]
>>  +	str	x8, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*4]
> 
> i think storing x8 is not useful.
> 
>>  +	stp	q0, q1, [x29, #OFFSET_RV + DL_OFFSET_RV_V0 + 32*0]
>>  +	stp	q2, q3, [x29, #OFFSET_RV + DL_OFFSET_RV_V0 + 32*1]
>>  +	stp	q4, q5, [x29, #OFFSET_RV + DL_OFFSET_RV_V0 + 32*2]
>>  +	stp	q6, q7, [x29, #OFFSET_RV + DL_OFFSET_RV_V0 + 32*3]
>> 
>>   	/* Setup call to pltexit  */
>>   	ldp	x0, x1, [x29, #OFFSET_SAVED_CALL_X0]
>>  @@ -295,9 +309,16 @@ _dl_runtime_profile:
>>   	add	x3, x29, #OFFSET_RV
>>   	bl	_dl_call_pltexit
>> 
>>  -	ldp	x0, x1, [x29, #OFFSET_RV + DL_OFFSET_RV_X0]
>>  -	ldp	d0, d1, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*0]
>>  -	ldp	d2, d3, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*1]
>>  +	ldp	x0, x1, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*0]
>>  +	ldp	x2, x3, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*1]
>>  +	ldp	x4, x5, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*2]
>>  +	ldp	x6, x7, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*3]
>>  +	ldr	x8, [x29, #OFFSET_RV + DL_OFFSET_RV_X0 + 16*4]
>>  +	ldp	q0, q1, [x29, #OFFSET_RV + DL_OFFSET_RV_V0 + 32*0]
>>  +	ldp	q2, q3, [x29, #OFFSET_RV + DL_OFFSET_RV_V0 + 32*1]
>>  +	ldp	q4, q5, [x29, #OFFSET_RV + DL_OFFSET_RV_V0 + 32*2]
>>  +	ldp	q6, q7, [x29, #OFFSET_RV + DL_OFFSET_RV_V0 + 32*3]
>>  +
>>   	/* LR from within La_aarch64_reg */
>>   	ldr	lr, [x29, #OFFSET_RG + DL_OFFSET_RG_LR]
>>   	cfi_restore(lr)
> 
> thanks for the patch.
>