This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] aarch64: Optimized memset for Kunpeng processor.
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Xuelei Zhang <zhangxuelei4 at huawei dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, "siddhesh at gotplt dot org" <siddhesh at gotplt dot org>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>, "jiangyikun at huawei dot com" <jiangyikun at huawei dot com>, "yikunkero at gmail dot com" <yikunkero at gmail dot com>
- Cc: nd <nd at arm dot com>
- Date: Tue, 29 Oct 2019 16:39:16 +0000
- Subject: Re: [PATCH] aarch64: Optimized memset for Kunpeng processor.
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=qBSmA/P6dFTEfJ2PJOwXCDUQQdPi8uYd9FySfhO0Rsc=; b=hFjLCRSkiQcNQedJcOZwPQM7rSi7YVLgXZWZj2CGiRoKbg/4fdw4WrUcN5iSmRlZMbPTj8hZy3xpFYUZEa9Xm+GbuLLSsvCJ18glb4p3TcU+bpGvEi/TceSb8yN5Nt/T+Pp3kmSHLvWaF1Fp6TCUNbIHgEMgYAicw78Bdb6/oBdRsavM3m4vmOXkxX832jkycFLm3PS7Z0b937yym1Hzxt0XR5yHmdC8UudvJVZsYM2QpfuzDII6UlX/Eu0AYwRB7oFu3KCjgzVmKZ0JcMik3uzoYuvbC0O8Mwexyy6hKecN/vnxQwlUv6YoDNme1iIGkT1LPMq4LipK8hqz9rnMCA==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=e8klFqqKGIC77ax3IkAYx4AZJuXG/X1sIDS98lxKxHQoIVQjRXsCXmaVi0f0xTxYfCyCSN9aH3kNwobO/l7Suy8jh2qdG5Vla8O8OOMBYe986lNqDogmY2jd8wSqSgcmBx3v6FJW/wIQ9IMf0l5av+ykBXg2s+jc7gn+A/Cqmdf7FwYSTxNHjzhm8h8uDfbh/k/oUtbeUCBBE8S8ryLTKredfHbmPINjze2qF87J0NbM60oongtj+Ke9FXMgWpkMXTb//9jdb8HxRr8+6UM3rAXf5VjalihLV0MBzoMNwAqY5Zu0si24gwwMGutMCpoZeaTPgFMjSnJW4ueTFyoDkQ==
- Original-authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
- References: <20191024130626.16776-1-zhangxuelei4@huawei.com>
Hi Xuelei,
> Due to the branch prediction issue of Kunpeng processor, we found
> memset_generic has poor performance on middle sizes setting, and so
> we reconstructed the logic, expanded the loop by 3 times in set_long
> to solve the problem, even when setting below 1K sizes have benefit.
Would it not make more sense to use traditional unrolling? Eg. process
128 or 256 bytes per iteration instead of 3x 64?
> Another change is that DZ_ZVA seems no work when setting zero, so we
> discarded it and used set_long to set zero instead. Fewer branches and
> predictions also make the zero case have slightly improvement.
You mean DC_ZVA does not work (ie. disabled in OS) or it doesn't give a
speedup? That sounds odd...
+ cmp count, 128
+ b.hs L(set_long)
+
+ cmp count, 16
+ b.lo L(less16)
Wouldn't it make more sense to first test for the small case?
+L(set112):
+ ands tmp1, dstin, 15
+ bne 2f
Is there really a gain in splitting out the aligned from unaligned case here?
You could either always align (which means 1 extra store) or just keep the
unaligned case (which uses fewer instructions, and will be best if already
aligned).
+ tbz count, 5, 1f
+ stp q0, q0, [dstin, 64]
+1: stp q0, q0, [dstend, -32]
+ tbz count, 5, 3f
+ stp q0, q0, [dst, 64]
+3: stp q0, q0, [dstend, -48]
There is little point in branching over just 1 instruction - it's cheaper just
to execute it than risk the misprediction (it would need to use dstend
rather than dstin).
+1: tbz count, 5, 2f
+ str q0, [dst, 32]
+ str q0, [dst, 48]
stp? And the branch over 1 instruction comment applies here too.
+2: stp q0, q0, [dstend, -32]
+ ret
+L(set_long):
+ and valw, valw, 255
valw is unused...
Cheers,
Wilco