This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2


On 01/10/18 23:42, Steve Ellcey wrote:
> On Mon, 2018-10-01 at 19:22 +0300, Anton Youdkevitch wrote:
>> +L(dst_unaligned):
>> +       /* For the unaligned store case the code loads two
>> +          aligned chunks and then merges them using ext
>> +          instrunction. This can be up to 30% faster than
>> +          the the simple unaligned store access.
>> +
>> +          Current state: tmp1 = dst % 16; C_q, D_q, E_q
>> +          contains data yet to be stored. src and dst points
>> +          to next-to-be-processed data. A_q, B_q contains
>> +          data already stored before, count = bytes left to
>> +          be load decremented by 64.
>> +
>> +          The control is passed here if at least 64 bytes left
>> +          to be loaded. The code does two aligned loads and then
>> +          extracts (16-tmp1) bytes from the first register and
>> +          tmp1 bytes from the next register forming the value
>> +          for the aligned store.
>> +
>> +          As ext instruction can only have it's index encoded
>> +          as immediate. 15 code chunks process each possible
>> +          index value. Computed goto is used to reach the
>> +          required code. */
>> +
>> +       /* Store the 16 bytes to dst and align dst for further
>> +          operations, several bytes will be stored at this
>> +          address once more */
>> +       str     C_q, [dst], #16
>> +       ldp     F_q, G_q, [src], #32
>> +       bic     dst, dst, 15
>> +       adr     tmp2, L(load_and_merge)
>> +       add     tmp2, tmp2, tmp1, LSL 7
>> +       sub     tmp2, tmp2, 128
>> +       br      tmp2
> 
> Anton,
> 
> As far as the actual code, I think my only concern is this use of a
> 'computed goto' to jump to one of the extract sections.  It seems very
> brittle since a change in the alignment of the various sections or a
> change in the size of those sections could mess up this jump.  Would
> the code be any slower if you used a jump table instead of a computed
> goto?

is the 16byte alignment really needed (i.e. 8byte is not enough)?
the code is fairly big with 16 alignment cases.
the indirect jump may be difficult to predict in real workloads.
otherwise the computed jump is acceptable, just document how
many instructions one entry can have at most (32?) so it's less
brittle in case somebody tries to modify the code.

the difference seems significant, so if you are happy with the
code i will accept it.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]