This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] aarch64: thunderx2 memmove performance improvements

From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
To: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>
Date: Tue, 7 May 2019 15:07:37 +0000
Subject: Re: [PATCH] aarch64: thunderx2 memmove performance improvements
References: <AM6PR08MB5078AC4DC3EB3DDD09A5960583280@AM6PR08MB5078.eurprd08.prod.outlook.com> <20190415145311.GA14156@bell-sw.com> <AM6PR08MB5078F0BC2FA04013A2C5AE9283240@AM6PR08MB5078.eurprd08.prod.outlook.com> <eb76ffed-92db-59c0-015d-aea36530fc8c@bell-sw.com> <DB6PR0801MB21189BE4DA5751E84563DE09833B0@DB6PR0801MB2118.eurprd08.prod.outlook.com>,<5CCBFDA0.8020405@bell-sw.com>

Hi Anton,

>> The loop is used both for small copies and for the tail of
>> very large copies.  The small copies might well be in the
>> cache, while the large copies are prefetched.
>
> Technically you are right. I was not clear enough, perhaps.
> What I meant was loads/stores have latency several times
> more than branches do. So, even if the data are in the cache
> the branching can be processed while loads/stores are still
> in the pipeline.

OK but if you're assuming the latency is high then what benefit
does unrolling give here? If the loop is 16-byte aligned fetch should
be optimal.

>> It's reasonable for now (and Szabolcs already approved your
>> latest version). But it is feasible to improve further given that the
>> memmove loop does 64 bytes per iteration, so if that is fast enough
>> then that may be a simpler way to handle this loop too.
> OK, I will see what I can do.
>
> BTW, the loop in question does 128 bytes per iteration, doesn't it?

The memmove loop does 64-bytes per iteration in all cases for the
overlap case:

+L(move_long):
...
+	.p2align 4
+1:
+	subs	count, count, 64
+	stp	A_q, B_q, [dstend, -32]
+	ldp	A_q, B_q, [srcend, -32]
+	stp	C_q, D_q, [dstend, -64]!
+	ldp	C_q, D_q, [srcend, -64]!
+	b.hi	1b

Wilco

References:
- Re: [PATCH] aarch64: thunderx2 memmove performance improvements
  - From: Wilco Dijkstra
- Re: [PATCH] aarch64: thunderx2 memmove performance improvements
  - From: Anton Youdkevitch

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]