This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: "Zhangxuelei (Derek)" <zhangxuelei4 at huawei dot com>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, "yikunkero at gmail dot com" <yikunkero at gmail dot com>, jiangyikun <jiangyikun at huawei dot com>
- Cc: nd <nd at arm dot com>
- Date: Tue, 29 Oct 2019 15:34:34 +0000
- Subject: Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Auvnhlux9MhEKG4j/SpxjQYTDjM+hwotpVQuin7CcKQ=; b=FEB7sJbzw2+QCV9ng52JoWWRQy/ma+e2MyWs6FhSCBKTOfVGjHfhAVvslcoAL6MnHh/UkdezLr2xoFvU3fCeB9dIS+AYmMOV3VpcesP9Nhj4eRz61pzd9+evn2rFA0xLHI9M/3be5Eg7ki1eZoVyXSEzNL0bgOCTbUIpdvUIUosF/1AzSprTnL7qWj0l5w+lUDWhBOQwUlttSr9KdrPVmGRdo2a+3VUrUd/57sYElZTu1Q9iiXzc7y00R8Zr5hwSTjguqtWD6VCVYR2Rm5qwCSnBfnM4IqZ+ZdlO9NSg9hH9NwgCVXte6iTs5ZYoTv/lwJJm6BjQRErSuVslDfokzg==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=NVZp59kwO+htkKDYDZl/mB5mTNqh9ihFwyKvRqce7R8WoS1SVmCVW2NX/geKw0x44hIbTjO6yyWYRpZsxKil9+Wc51XQ2/uCObCoLVcTBi98vixoUuLrwr4jLT5do7lFVRZCqD8wPlBpS3xFz8wHPel1vel9Q8S1eeF53ZZ9DF5otD4Av3tgqK3LKTyualtsSS/kfiEbJ3NCkMIeJaiDf0XN4LnX/9I+aqUHUla82X0TAaI8i7E0HD9Ro+T7dFN/+MuZ/egqu4zFmhmM2xnbt7SvDog1LgUXwD4nM+KQuWImrdED1pjwaZLECW0dgQd/OSIMNbyc+O6Rnp5sF348Dg==
- Original-authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
- References: <8DC571DDDE171B4094D3D33E9685917BD854D1@DGGEMI529-MBX.china.huawei.com>
>> Well these results show a very significant 4% win for Falkor memcpy! It seems strange to only optimize
>> for large sizes when the vast majority of copies in real code are very small (note the distribution of the
>> sizes and alignment for the random benchmark come from SPEC).
> Sure, we agree the falkor memcpy has 4% win on small size. However, at the beginning we start to Kunpeng
> optimized the memcpy, one of the most important case is database case, which really need more improvement
> on large size.
Being fast on small copies is not mutually exclusive with being fast on large copies - there are different
code paths for these cases. I'd find it hard to believe databases do huge memcopies, that would be stupid!
> And what confusing us now is that, we removed dst_unaligned code in memcpy according to the previous comments,
> which did not affect performance after testing in memcpy cases. But in the case when uses memmove function and
> enters the memcpy part, unaligned cases is significantly slower than aligned case according to the results of the first
> half part of memmove-walk as shown in the bottom. So do you think we should still remove dst_unaligned code?
Well it seems to me the issue is related to prefetching/caching. memcpy-walk walks both src and dst backwards,
memmove-walk is pretty much identical for the non-overlap case but it does a forward walk on dst and a backward
one on src. There shouldn't be any performance difference between the two cases. On most microarchitectures I see
no difference between these walks, memmove-walk and memcpy-walk basically have identical performance both for
aligned and unaligned cases (alignment doesn't matter for large copies at all).
> We analyse the reason is of more judgement in the begin of memmove and may weak processor ability to handle
> this case, and so dst_unaligned make difference.
The extra instructions at the start couldn't possibly make a difference apart from slowing down the small memmoves.