This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2

On Mon, 2018-10-01 at 19:22 +0300, Anton Youdkevitch wrote:
> +L(dst_unaligned):
> +       /* For the unaligned store case the code loads two
> +          aligned chunks and then merges them using ext
> +          instrunction. This can be up to 30% faster than
> +          the the simple unaligned store access.
> +
> +          Current state: tmp1 = dst % 16; C_q, D_q, E_q
> +          contains data yet to be stored. src and dst points
> +          to next-to-be-processed data. A_q, B_q contains
> +          data already stored before, count = bytes left to
> +          be load decremented by 64.
> +
> +          The control is passed here if at least 64 bytes left
> +          to be loaded. The code does two aligned loads and then
> +          extracts (16-tmp1) bytes from the first register and
> +          tmp1 bytes from the next register forming the value
> +          for the aligned store.
> +
> +          As ext instruction can only have it's index encoded
> +          as immediate. 15 code chunks process each possible
> +          index value. Computed goto is used to reach the
> +          required code. */
> +
> +       /* Store the 16 bytes to dst and align dst for further
> +          operations, several bytes will be stored at this
> +          address once more */
> +       str     C_q, [dst], #16
> +       ldp     F_q, G_q, [src], #32
> +       bic     dst, dst, 15
> +       adr     tmp2, L(load_and_merge)
> +       add     tmp2, tmp2, tmp1, LSL 7
> +       sub     tmp2, tmp2, 128
> +       br      tmp2


As far as the actual code, I think my only concern is this use of a
'computed goto' to jump to one of the extract sections.  It seems very
brittle since a change in the alignment of the various sections or a
change in the size of those sections could mess up this jump.  Would
the code be any slower if you used a jump table instead of a computed

The rest of my comments are just some minor suggestions to improve the
comments or fix some typos.

+/* Copies are split into 3 main cases: small copies of up to 16 bytes,
> +   medium copies of 17..96 bytes which are fully unrolled. Large copies
> +   of more than 96 bytes align the destination and use an unrolled loop
> +   processing 64 bytes per iteration.
> +   The current optimized memcpy implementation is not compatible with
> +   memmove and is separated from it completely. See below.
> +   Overlapping large forward memmoves use a loop that copies backwards.
> +*/

Since this comment is in front of memmove (which is now completely
separate from memcpy), it should probably just talk about memmove and
then you can have a separate comment in front of memcpy which may
duplicate some of this explanation.  Including the line about memcpy
being incompatible with memmove is still appropriate.  So instead of
'Copies are split' have 'Moves are split'. etc.

+/* memcpy implementation below is not compatible with memmove
> +   because of pipelined loads/stores, which are faster, but they
> +   can't be used in the case of overlapping memmove arrays */

Expand this by copying some of the text from memmove but still 
include the text about why it is not compatible with memmove.

> +       /* the range of count being [65..96] becomes [65..111]
> +          after tmp [0..15] gets added to it,
> +          count now is <bytes-left-to-load>+48 */

Start with uppercase 'T'  (The range vs. the range).

> +       /* For the unaligned store case the code loads two
> +          aligned chunks and then merges them using ext
> +          instrunction. This can be up to 30% faster than

instrunction -> instruction

Steve Ellcey

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]