This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] aarch64: Optimized memset for Kunpeng processor.

From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
To: Xuelei Zhang <zhangxuelei4 at huawei dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, "siddhesh at gotplt dot org" <siddhesh at gotplt dot org>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>, "jiangyikun at huawei dot com" <jiangyikun at huawei dot com>, "yikunkero at gmail dot com" <yikunkero at gmail dot com>
Cc: nd <nd at arm dot com>
Date: Tue, 29 Oct 2019 16:39:16 +0000
Subject: Re: [PATCH] aarch64: Optimized memset for Kunpeng processor.
Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=qBSmA/P6dFTEfJ2PJOwXCDUQQdPi8uYd9FySfhO0Rsc=; b=hFjLCRSkiQcNQedJcOZwPQM7rSi7YVLgXZWZj2CGiRoKbg/4fdw4WrUcN5iSmRlZMbPTj8hZy3xpFYUZEa9Xm+GbuLLSsvCJ18glb4p3TcU+bpGvEi/TceSb8yN5Nt/T+Pp3kmSHLvWaF1Fp6TCUNbIHgEMgYAicw78Bdb6/oBdRsavM3m4vmOXkxX832jkycFLm3PS7Z0b937yym1Hzxt0XR5yHmdC8UudvJVZsYM2QpfuzDII6UlX/Eu0AYwRB7oFu3KCjgzVmKZ0JcMik3uzoYuvbC0O8Mwexyy6hKecN/vnxQwlUv6YoDNme1iIGkT1LPMq4LipK8hqz9rnMCA==
Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=e8klFqqKGIC77ax3IkAYx4AZJuXG/X1sIDS98lxKxHQoIVQjRXsCXmaVi0f0xTxYfCyCSN9aH3kNwobO/l7Suy8jh2qdG5Vla8O8OOMBYe986lNqDogmY2jd8wSqSgcmBx3v6FJW/wIQ9IMf0l5av+ykBXg2s+jc7gn+A/Cqmdf7FwYSTxNHjzhm8h8uDfbh/k/oUtbeUCBBE8S8ryLTKredfHbmPINjze2qF87J0NbM60oongtj+Ke9FXMgWpkMXTb//9jdb8HxRr8+6UM3rAXf5VjalihLV0MBzoMNwAqY5Zu0si24gwwMGutMCpoZeaTPgFMjSnJW4ueTFyoDkQ==
Original-authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
References: <20191024130626.16776-1-zhangxuelei4@huawei.com>

Hi Xuelei,

> Due to the branch prediction issue of Kunpeng processor, we found
> memset_generic has poor performance on middle sizes setting, and so
> we reconstructed the logic, expanded the loop by 3 times in set_long
> to solve the problem, even when setting below 1K sizes have benefit.

Would it not make more sense to use traditional unrolling? Eg. process
128 or 256 bytes per iteration instead of 3x 64?

> Another change is that DZ_ZVA seems no work when setting zero, so we
> discarded it and used set_long to set zero instead. Fewer branches and
> predictions also make the zero case have slightly improvement.

You mean DC_ZVA does not work (ie. disabled in OS) or it doesn't give a
speedup? That sounds odd...

+	cmp	count, 128
+	b.hs	L(set_long)
+
+	cmp	count, 16
+	b.lo	L(less16)

Wouldn't it make more sense to first test for the small case?

+L(set112):
+	ands	tmp1, dstin, 15
+	bne	2f

Is there really a gain in splitting out the aligned from unaligned case here?
You could either always align (which means 1 extra store) or just keep the
unaligned case (which uses fewer instructions, and will be best if already
aligned).

+	tbz	count, 5, 1f
+	stp	q0, q0, [dstin, 64]
+1:	stp	q0, q0, [dstend, -32]

+	tbz	count, 5, 3f
+	stp	q0, q0, [dst, 64]
+3:	stp	q0, q0, [dstend, -48]

There is little point in branching over just 1 instruction - it's cheaper just
to execute it than risk the misprediction (it would need to use dstend
rather than dstin).

+1:	tbz	count, 5, 2f
+	str	q0, [dst, 32]
+	str	q0, [dst, 48]

stp? And the branch over 1 instruction comment applies here too.

+2:	stp	q0, q0, [dstend, -32]
+	ret

+L(set_long):
+	and	valw, valw, 255

valw is unused...

Cheers,
Wilco

References:
- [PATCH] aarch64: Optimized memset for Kunpeng processor.
  - From: Xuelei Zhang

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]