This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
- From: Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>
- To: Xuelei Zhang <zhangxuelei4 at huawei dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Cc: nd <nd at arm dot com>
- Date: Tue, 15 Oct 2019 12:04:43 +0000
- Subject: Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=P2sNhUts5uL59yWjAIGgfg6XAiW3XwlE73XGMbf3AL4=; b=GDKICYalN/PFFAY23K1fddPMw3KRl3ujXKFOSjbJhO/UKUeFYlTVGtZddnP0RGFFj3bMAhE9EgrYOIS7OXFhcTm0eQmArXwDP7jNLybm/3XOd7CcTLNt9Y6SdmzYjU2ZTZYM0svy3dJow0eGKhinUWZCbyZAfUCvoNSWyr4iwBg2HHkFIRX5LrgpGZui99jk1IkkYd/YPz2fLWbKTmkhsfs0VmIo4X3B1XIDJS+mF3dWXhrQ64ffkI4bONHenegLFUYFqWO1wV9SlUIor9Kw4Qi8EHtP7WsaW0OjdwcXxJSGEz1T9urTbhIp8j+VHmL4bTxROUaedmyvVgMvNoew1g==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Y/oOYOElm7WoMBokJdnr2jjnUQsFUk74+OiT+xKMwV/YKH/0ICA829EGnjsG/HTm5OIhPHirDHYoqHyS1l59lWiNx4EVyMUZYSKcwqPbfOwhbzVIJA4avPO9D08aPGCynwwl62qtr7JSMN+TNMQSaOOBXIaVq5ds0QPUds2L62YRTfmGWlkgbro8YQHKZmWTj/vFEx3xY8bR+d7XPrhrVzKLj5Qir/32fWJLWq3KCUU+66jhWHSkRInaME4PmmtNu6ItUSvasjtHXMdcDKwLUh+PflH6Yi29Jibl49+fzl2MSWzM4VayPTjFBSHOXTrAypjwZKmwPx/SClWt+qAwaw==
- Original-authentication-results: spf=none (sender IP is ) smtp.mailfrom=Szabolcs dot Nagy at arm dot com;
- References: <20191014034456.11548-1-zhangxuelei4@huawei.com>
On 14/10/2019 04:44, Xuelei Zhang wrote:
> This is an optimized implementation of the memcpy and memmove on the
> Huawei Kunpeng processor.
>
> Based on the prefetch mechanism on Kunpeng arch, branch to handle 96
> to 2K bytes in memcpy is written without prfm instruction. Hence,
> memcpy has an optimization effect above 128 bytes, 18% improvement
> for copies above 2K bytes, and 38% for larger bytes, such as 32M
> bytes around.
>
> And for memmove, there are two main changes: i) Q register is used
> instead of X register. ii) dst address is aligned instead of src
> address aligned to improve store operation. Hence, memmove
> implementation also has improvement above 128 bytes, that about 30%
> for 2k to 8M bytes, and about 50% for 32M or more.
i'd like to work on a generic memcpy that's acceptable
instead of minor variations of memcpy per uarch, i'll
have to take a look why this one is different from all
the others.
it would be nice to see the memcpy-random benchmarks too.
stopping prefetch at 2k is surprising.