This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: 'GNU C Library' <libc-alpha at sourceware dot org>, "zhangxuelei4 at huawei dot com" <zhangxuelei4 at huawei dot com>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>
- Cc: nd <nd at arm dot com>
- Date: Tue, 15 Oct 2019 13:26:31 +0000
- Subject: Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=tzH3iDt4KdxsjeTtzDelMc/9tEBYGjyLLVZF6ORpDGI=; b=FMKVhfYc0Ximr0rWwl2vTNIq/sTcEaT3UGA1CgEzinqgZGEzV0mevE7zHhrN2SWLbYrZqrwmYq96oQ4a3F3wTOIa3GIqP7F6wg0RUcnPQJ1MRm+6AWHrZPneM5klDv9xaadtDOQfpBeGOWybB6w1j9/m1codFSpZdSEmZlDz9q030AOm5u/O6dlMsoASq/2QhGqF3PNJlzeQ49W2y5Z1JZv3Vsr3MQgcDKijiCf9cjksjd738RJIhV0IsVhXRHn75nj6Ed3dzlrUn4Mu0+QXtPTxwiC/svXdx45bUXqVplBPW0ebvwWgiiBS5Jf3gHSLgsCd5xjRm8ZD+/5rbNtbPQ==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=OGwJSOF0TRWxMGc6sj+/VhEGcnC5L/0RelpRoD1OrdQhjGS8b8GTjZ5GLqvvHxtLcbZ2AKAnny4WyGPNn+FYI+21jc3On/Qavm8eocf1EIqaT7X9JBg/4m7yfDOWgia4EauYhKhu6TaG6ECSnL+ld3tc8zUsZWFhCM7NxnZhxZKa+GgxO6ZGHJzu7lAaliHEQ1mYX6daRxgT/cmkblW3usugMCWciL/+aiU01ak1I1fd3Vcms8OCvB+AtidZGwPj/ZQH+fANKbZGsla730gFApQei0eO9/eFHB9XpH+bVLmYOcOTzd8LlE2WB78X9CCLbKrEEc8QxTZ9Xz0yOuEEpg==
- Original-authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
> i'd like to work on a generic memcpy that's acceptable
> instead of minor variations of memcpy per uarch, i'll
> have to take a look why this one is different from all
> the others.
Yes it seems the key thing we need is a generic Q-register memcpy.
> it would be nice to see the memcpy-random benchmarks too.
> stopping prefetch at 2k is surprising.
Also it's odd it's based on the ThunderX2 variant rather than the Falkor
one, particularly since for large copies misaligned accesses are cheap.
Briefly looking at the data for memcpy, it seems the Falkor results are typically
faster for large copy sizes, eg. from 512KB to 4MBytes.
On the other hand, the memmove results for Kunpeng look genuinely faster than
the existing implementations - and that is without prefetching or special code to
handle unaligned cases. So that suggests to me these don't help much, and all
we need is code that does Q-register copies/moves (clearly using LDP/STP as that
is where the memmove seems to win).