This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2


Szabolcs,

On 02.10.2018 14:21, Szabolcs Nagy wrote:
On 01/10/18 23:42, Steve Ellcey wrote:
On Mon, 2018-10-01 at 19:22 +0300, Anton Youdkevitch wrote:
+L(dst_unaligned):
+       /* For the unaligned store case the code loads two
+          aligned chunks and then merges them using ext
+          instrunction. This can be up to 30% faster than
+          the the simple unaligned store access.
+
+          Current state: tmp1 = dst % 16; C_q, D_q, E_q
+          contains data yet to be stored. src and dst points
+          to next-to-be-processed data. A_q, B_q contains
+          data already stored before, count = bytes left to
+          be load decremented by 64.
+
+          The control is passed here if at least 64 bytes left
+          to be loaded. The code does two aligned loads and then
+          extracts (16-tmp1) bytes from the first register and
+          tmp1 bytes from the next register forming the value
+          for the aligned store.
+
+          As ext instruction can only have it's index encoded
+          as immediate. 15 code chunks process each possible
+          index value. Computed goto is used to reach the
+          required code. */
+
+       /* Store the 16 bytes to dst and align dst for further
+          operations, several bytes will be stored at this
+          address once more */
+       str     C_q, [dst], #16
+       ldp     F_q, G_q, [src], #32
+       bic     dst, dst, 15
+       adr     tmp2, L(load_and_merge)
+       add     tmp2, tmp2, tmp1, LSL 7
+       sub     tmp2, tmp2, 128
+       br      tmp2

Anton,

As far as the actual code, I think my only concern is this use of a
'computed goto' to jump to one of the extract sections.  It seems very
brittle since a change in the alignment of the various sections or a
change in the size of those sections could mess up this jump.  Would
the code be any slower if you used a jump table instead of a computed
goto?

is the 16byte alignment really needed (i.e. 8byte is not enough)?
the code is fairly big with 16 alignment cases.
Unfortunately, yes. As the code deals with 16-bytes chunks the
optimal results are with the memory addresses that are 16 bytes
aligned.

the indirect jump may be difficult to predict in real workloads.
otherwise the computed jump is acceptable, just document how
many instructions one entry can have at most (32?) so it's less
brittle in case somebody tries to modify the code.
Like I answered Steve this is probably more or less the same
performance-wise. I will change the code to use jump table
and rerun the bencharks (I don't expect them to be different,
though).


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]