This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2


Steve,

On 02.10.2018 01:42, Steve Ellcey wrote:
> On Mon, 2018-10-01 at 19:22 +0300, Anton Youdkevitch wrote:
>>
>> +L(dst_unaligned):
>> +       /* For the unaligned store case the code loads two
>> +          aligned chunks and then merges them using ext
>> +          instrunction. This can be up to 30% faster than
>> +          the the simple unaligned store access.
>> +
>> +          Current state: tmp1 = dst % 16; C_q, D_q, E_q
>> +          contains data yet to be stored. src and dst points
>> +          to next-to-be-processed data. A_q, B_q contains
>> +          data already stored before, count = bytes left to
>> +          be load decremented by 64.
>> +
>> +          The control is passed here if at least 64 bytes left
>> +          to be loaded. The code does two aligned loads and then
>> +          extracts (16-tmp1) bytes from the first register and
>> +          tmp1 bytes from the next register forming the value
>> +          for the aligned store.
>> +
>> +          As ext instruction can only have it's index encoded
>> +          as immediate. 15 code chunks process each possible
>> +          index value. Computed goto is used to reach the
>> +          required code. */
>> +
>> +       /* Store the 16 bytes to dst and align dst for further
>> +          operations, several bytes will be stored at this
>> +          address once more */
>> +       str     C_q, [dst], #16
>> +       ldp     F_q, G_q, [src], #32
>> +       bic     dst, dst, 15
>> +       adr     tmp2, L(load_and_merge)
>> +       add     tmp2, tmp2, tmp1, LSL 7
>> +       sub     tmp2, tmp2, 128
>> +       br      tmp2
>
> Anton,
>
> As far as the actual code, I think my only concern is this use of a
> 'computed goto' to jump to one of the extract sections.  It seems very
> brittle since a change in the alignment of the various sections or a
> change in the size of those sections could mess up this jump.  Would
> the code be any slower if you used a jump table instead of a computed
> goto?
Well, I don't think this is too brittle as all the chunks are
identical and any changes will be reflected in all of them.
And since the address gets computed in only one place I don't
think this is going to break accidentally.

However, the change to jump table will not slow down the code
as it is only one time branch and the loop where all the processing
takes place will still be the same. OK, I will redo it with the
table.


> The rest of my comments are just some minor suggestions to improve the
> comments or fix some typos.
>
>
> +/* Copies are split into 3 main cases: small copies of up to 16 bytes,
>> +   medium copies of 17..96 bytes which are fully unrolled. Large copies
>> +   of more than 96 bytes align the destination and use an unrolled loop
>> +   processing 64 bytes per iteration.
>> +   The current optimized memcpy implementation is not compatible with
>> +   memmove and is separated from it completely. See below.
>> +   Overlapping large forward memmoves use a loop that copies backwards.
>> +*/
>
> Since this comment is in front of memmove (which is now completely
> separate from memcpy), it should probably just talk about memmove and
> then you can have a separate comment in front of memcpy which may
> duplicate some of this explanation.  Including the line about memcpy
> being incompatible with memmove is still appropriate.  So instead of
> 'Copies are split' have 'Moves are split'. etc.OK.

> +/* memcpy implementation below is not compatible with memmove
>> +   because of pipelined loads/stores, which are faster, but they
>> +   can't be used in the case of overlapping memmove arrays */
>
> Expand this by copying some of the text from memmove but still
> include the text about why it is not compatible with memmove.OK.

>> +       /* the range of count being [65..96] becomes [65..111]
>> +          after tmp [0..15] gets added to it,
>> +          count now is <bytes-left-to-load>+48 */
>
> Start with uppercase 'T'  (The range vs. the range).
>
> +L(dst_unaligned):
>> +       /* For the unaligned store case the code loads two
>> +          aligned chunks and then merges them using ext
>> +          instrunction. This can be up to 30% faster than
>
> instrunction -> instructionOK, thanks.

Will do.

--
  Thanks,
  Anton


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]