This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v2] aarch64: Optimized implementation of memcmp
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Xuelei Zhang <zhangxuelei4 at huawei dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, "siddhesh at gotplt dot org" <siddhesh at gotplt dot org>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>, "jiangyikun at huawei dot com" <jiangyikun at huawei dot com>, "yikunkero at gmail dot com" <yikunkero at gmail dot com>
- Cc: nd <nd at arm dot com>
- Date: Tue, 22 Oct 2019 18:06:29 +0000
- Subject: Re: [PATCH v2] aarch64: Optimized implementation of memcmp
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=lHiKJzkApF6TFndLN9pwpNx62ZCkyFhRqsEC8EYGreo=; b=PmwSASN+OmA4dxGJd2ab0l0XlNqLusWFUsEObKZMSpAa2F/owYJqUO2AQPvBEYg+Jtc039ETCASvc0C1KUEiO7NkuH9T1moqr9PtvRf2AhuC9NT3SAo2EzAvmeBR+WFNQvkRAwfDZbhiUR6ytFzW7BNu/bQNaKePPV0ALewx3fHMMnzx11ER7T6BzBl5givmpcJAjFDGxK38S5QgtKWu0UFjTQJKdYU40qPbaxgj1eKwuTAFtMJhsOvn/P8aWt9Jn99F8zV4Pk/zJwkaUhnFQuZnkF7QrVUAlSWKz90vLGO/Mguz/faiWRiBY7sLOM+4tYXCwQu4uZHxb7QfU0OunA==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VuHqmFjIMWVtyxNjDeg3ny5x4LoVasgSs04stf6xqPUmpMa8JC6ws/nSThuXux8ysJzGkz8AN5JIAEHmO4qTJmG+XYAShKnXN5XcPGT5u6t30rOMcqqKki2Ygbo01xnWYcqyKRFJmQnaWBzGIuF17uT9nle1vGX7+KJsNkU3DqKdx2Lg9fs8y2OKXc228Dbbtj52IxcjQkMd0TTszIWnkvuZjCyOpSkcs+nijACOQ0tNU4FnPpxefP3IIQFZi/NN3Sam8bv4BhVyg0S+AeJFv/DMfdhlGBHhI6p6sfxf52FpHhthe3xQOvWE2yACs1pcHwmnRn9UEpSEiXXeQ1kmDA==
- Original-authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
- References: <20191022093827.9072-1-zhangxuelei4@huawei.com>
Hi Xuelei,
> The loop body is expanded from a 16-byte comparison to a 64-byte
> comparison, and the usage of ldp is replaced by the Post-index
> mode to the Base plus offset mode. Hence, compare can faster 18%
> around > 128 bytes in all.
This looks quite good - I can reproduce significant gains for large sizes
on various microarchitectures. It seems there are some regressions in
the 8-16 byte range, presumably due to handling these sizes differently.
A few comments inline:
+ /* Compare data bytes and set return value to 0, -1 or 1. */
+L(return64):
+ cmp data1, data2
bne L(return)
+L(return_pre):
mov data1, data1h
mov data2, data2h
- cmp data1, data2
L(return):
The label return_pre is unused. So why not use 2xCSEL rather than a branch across
the moves? That's going to be faster since the branch will be hard to predict.
L(less8):
adds limit, limit, 4
b.lo L(less4)
- ldr data1w, [src1], 4
- ldr data2w, [src2], 4
- cmp data1w, data2w
+ ldr data1w, [src1]
+ ldr data2w, [src2]
+ ccmp data1, data2, 0, ne
Using data1w and data2w would be better here.
b.eq L(byte_loop)
- sub result, data1w, data2w
+ sub result, data1w, data2w
The formatting has gone wrong...
+ ret
+L(ret_0):
+ mov result, 0
ret