This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2

From: Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>
To: "sellcey at cavium dot com" <sellcey at cavium dot com>, Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>, Siddhesh Poyarekar <siddhesh at gotplt dot org>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
Cc: nd <nd at arm dot com>
Date: Tue, 2 Oct 2018 11:21:32 +0000
Subject: Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2
References: <2063a582-d65f-9e9f-50f5-80e4502edbd8@gotplt.org> <1538408223.18948.85.camel@cavium.com> <0899c6de-9462-8cca-5283-adc263d4b650@gotplt.org> <20181001162219.GA8242@bell-sw.com> <1538433762.18948.108.camel@cavium.com>

On 01/10/18 23:42, Steve Ellcey wrote:
> On Mon, 2018-10-01 at 19:22 +0300, Anton Youdkevitch wrote:
>> +L(dst_unaligned):
>> +       /* For the unaligned store case the code loads two
>> +          aligned chunks and then merges them using ext
>> +          instrunction. This can be up to 30% faster than
>> +          the the simple unaligned store access.
>> +
>> +          Current state: tmp1 = dst % 16; C_q, D_q, E_q
>> +          contains data yet to be stored. src and dst points
>> +          to next-to-be-processed data. A_q, B_q contains
>> +          data already stored before, count = bytes left to
>> +          be load decremented by 64.
>> +
>> +          The control is passed here if at least 64 bytes left
>> +          to be loaded. The code does two aligned loads and then
>> +          extracts (16-tmp1) bytes from the first register and
>> +          tmp1 bytes from the next register forming the value
>> +          for the aligned store.
>> +
>> +          As ext instruction can only have it's index encoded
>> +          as immediate. 15 code chunks process each possible
>> +          index value. Computed goto is used to reach the
>> +          required code. */
>> +
>> +       /* Store the 16 bytes to dst and align dst for further
>> +          operations, several bytes will be stored at this
>> +          address once more */
>> +       str     C_q, [dst], #16
>> +       ldp     F_q, G_q, [src], #32
>> +       bic     dst, dst, 15
>> +       adr     tmp2, L(load_and_merge)
>> +       add     tmp2, tmp2, tmp1, LSL 7
>> +       sub     tmp2, tmp2, 128
>> +       br      tmp2
> 
> Anton,
> 
> As far as the actual code, I think my only concern is this use of a
> 'computed goto' to jump to one of the extract sections.  It seems very
> brittle since a change in the alignment of the various sections or a
> change in the size of those sections could mess up this jump.  Would
> the code be any slower if you used a jump table instead of a computed
> goto?

is the 16byte alignment really needed (i.e. 8byte is not enough)?
the code is fairly big with 16 alignment cases.
the indirect jump may be difficult to predict in real workloads.
otherwise the computed jump is acceptable, just document how
many instructions one entry can have at most (32?) so it's less
brittle in case somebody tries to modify the code.

the difference seems significant, so if you are happy with the
code i will accept it.

Follow-Ups:
- Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2
  - From: Anton Youdkevitch

References:
- Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2
  - From: Steve Ellcey
- Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2
  - From: Siddhesh Poyarekar
- Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2
  - From: Anton Youdkevitch
- Re: [PATCH] aarch64: optimized memcpy implementation for thunderx2
  - From: Steve Ellcey

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]