A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)

Wed Aug 4 18:11:01 GMT 2021

On 30/07/2021 21:59, Ben Woodard wrote:
>>>
>>> Take this with a grain of salt you are the expert not me. These are three minor issues that I don’t really like about this approach:
>>> 1) I think having lr_sve be an uint8_t will mess up the alignment of the V registers that are immediately after it. It probably needs to be at the end of the structure or it needs to be padded out for alignment.
>>
>> No sure if there will be alignment constraint here, since SVE instruction afaik
>> can load/store unaligned memory and this is not a performance critical code. 
>> In any case, that's why I added 'long double' for the 'lz_zreg' since it should
>> at least set 16 bytes alignment.
>>
> 
> I wasn’t really worried about the Z registers here. I was thinking about the V registers. I tried fixing this myself a few times and one of my attempts looked like yours with a char  lr_sve and in that case I ran into alignment problems. The difference between what you are doing and what I tried is that I moved what are now the D registers (which is broken - you need to save the full 128b V register — this was copied over from V7 and wasn’t updated for 64b V8) down in the structure to after lr_sve and that is what caused the alignment problem in the assembly portion of the code.
> 
> typedef struct La_sve_regs {
>   uint16_t    *lr_preg[3];
>   __uint128_t *lr_zreg[8];
> } La_sve_regs;
> 
> /* Registers for entry into PLT on AArch64.  */
> typedef struct La_aarch64_regs
> {
>     uint64_t    lr_xreg[9];
>     uint64_t    lr_sp;
>     uint64_t    lr_lr;
> =>  char    lr_sve; /* 0 - no SVE, 
>                          1-16 length of the SVE registers in vq (128 bits) */
> 
> => // Note that this is *BEFORE* the V registers 
>   union {
>     /* when sve!=0 accessing the lr_vreg is undefined */
>     __uint128_t lr_vreg[8];
>     La_sve_regs lr_zreg;
>   };
> } La_aarch64_regs;
> 
>>> 2) Second it appears like you have to traverse two pointers to actually get at the registers. In my opinion, it would be better if everything were just offsets from the beginning of the structure so that the actual memory for the Z and P registers is immediately following the La_aarch64_regs structure.
>>
>> It is a trade-off to avoid add minimize the struct space to non-SVE code.  But
>> I don't have a strong preference, we can use this instead:
>>
>> typedef struct La_aarch64_regs
>> {
>>  [...]
>>  uint8_t  lr_sve;           /* 0 - no SVE
>>                                1-16 length of the SVE registers in vq (128 bits)  */
>>  uint16_t    *lr_preg[8];   /* NULL - no SVE.  */
>>  long double *lr_zreg[4];   /* NULL - no SVE.  */
>> };
>>
> 
> You are correct, This isn’t a performance critical code path. If you are watching the pltenter/exit things are going to suck anyway.  I was trying to be too clever. I was sort of thinking: Whose going to allocate the memory for this structure - how do we make it thread safe how are we going to deal with all 100 threads coming through here all at once. So I was thinking let’s keep the whole register structure on the stack. 
> 
> With your idea my best guess is that the lr_sve_regs would be allocated immediately after the La_aarch64_regs structure on the stack and so in memory it would look like:
> 
>   uint64_t    lr_xreg[9];
>   __uint128_t lr_vreg[8];
>   uint64_t    lr_sp;
>   uint64_t    lr_lr;
>   uint8_t;    lr_sve; 
>   La_sve_regs *lr_sve_regs; -> SVE_REGS
> SVE_REGS:
>    uint16_t    *lr_preg; -> SVE_P_REGS
>    long double *lr_zreg; -> SVE_P_REGS+3*lr_sve*sizeof(uint16_t)
> SVE_P_REGS;
>    unit16_t lr_preg[3*lr_sve]; 
>    long double lr_zreg[8*lr_sve];
> 
> If my guess about where the memory was going to be allocated is correct, then you have a bunch or pointer arithmetic for things which going to be right there anyway at static offsets.
> 
> Really it doesn’t matter, we can do it anyway you like. We just need it to work. Your way makes sense to a normal C programmer. I was trying to come up with something more clever that would take up less space which wouldn’t have all the pointer following. However, I couldn’t get it working.

I updated my branch with a POC patch to support SVE for rtld-audit [1].
In the end the layout I ended up using is:

  typedef union
  {
    float s;
    double d;
    long double q;
    long double *z;
  } La_aarch64_vector;

  /* Registers for entry into PLT on AArch64.  */
  typedef struct La_aarch64_regs
  {
    uint64_t          lr_xreg[9];
    La_aarch64_vector lr_vreg[8];
    uint64_t          lr_sp;
    uint64_t          lr_lr;
    uint8_t           lr_sve;
    uint16_t          *lr_sve_pregs[4];
  } La_aarch64_regs;

  /* Return values for calls from PLT on AArch64.  */
  typedef struct La_aarch64_retval
  {
    /* Up to eight integer registers can be used for a return value.  */
    uint64_t          lrv_xreg[8];
    /* Up to eight V registers can be used for a return value.  */
    La_aarch64_vector lrv_vreg[8];
    uint8_t           lrv_sve;
  } La_aarch64_retval;

It means the if 'lr_sve' is 0 in either La_aarch64_regs or La_aarch64_retval
the La_aarch64_vector contains the floating-pointer registers that can be
accessed directly.  Otherwise, 'La_aarch64_vector.z' points to a memory area
that holds up to 'lr_sve' bytes from the Z registers, which can be loaded
with svld1 intrinsic for instance (as tst-audit28.c does). The P register
follows the same logic, with each La_aarch64_regs.lr_sve_pregs pointing
to an area of memory 'lr_sve/8' in size.

So, to access the FP register as float you can use:

  static inline float regs_vec_to_float (const La_aarch64_regs *regs, int i)
  {
    float r;
    if (regs->lr_sve == 0)
      r = regs->lr_vreg[i].s;
    else
      memcpy (&r, &regs->lr_vreg[i].z[0], sizeof (r));
    return r;
  }

To implement it I had to setup lazy binding when profiling or auditing is
used, even when STO_AARCH64_VARIANT_PCS is being set.  Also, to not incur
in performance penalties on default non-SVE configuration, the patch uses
a new PTL entrypoint, _dl_runtime_profile_sve, which is used iff 'hwcap'
has HWCAP_SVE bit set.

I think this is a fair assumption since SVE has a defined set of registers
for argument passing and return values.  A new ABI with either different
argument passing or different registers would require a different PLT
entry, but I assume this would require another symbol flag anyway (or
at least a different ELF mark to indicate so).

For this POC, the profile '_dl_runtime_profile_sve' entrypoint assumes
the largest SVE register size possible (2048 bits) and thus it requires
a quite large stack (8976 bytes).  I think it would be possible make the
stack requirement dynamic depending of the vector length, but it would
make the PLT audit function way more complex.

This patch is not complete yet: the tst-audit28 does not check if compiler
supports SVE (we would need a configure check to disable for such case),
I need to add a proper comment for the _dl_runtime_profile_sve
stack layout, the test need to check for the P register state clobbering.

I also haven't check the performance penalties with this approach, and
maybe the way I am saving/restoring the SVE register might be optimized.

In any case, I checked on a SVE machine and at least the testcase work
as expected without any regressions.  I also did a sniff test on a non SVE
machine.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ld-audit-fixes