This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 25/27] S390: Optimize wmemset.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 29 Jun 2015 14:05:38 +0200
- Subject: Re: [PATCH 25/27] S390: Optimize wmemset.
- Authentication-results: sourceware.org; auth=none
- References: <1435319512-22245-1-git-send-email-stli at linux dot vnet dot ibm dot com> <1435319512-22245-26-git-send-email-stli at linux dot vnet dot ibm dot com> <20150626135358 dot GE20165 at domone> <mmr2r8$rf2$2 at ger dot gmane dot org>
On Mon, Jun 29, 2015 at 11:23:52AM +0200, Stefan Liebler wrote:
> On 06/26/2015 03:53 PM, OndÅej BÃlka wrote:
> >On Fri, Jun 26, 2015 at 01:51:50PM +0200, Stefan Liebler wrote:
> >>This patch provides optimized version of wmemset with the z13 vector
> >>instructions.
> >>
> >Why do you optimize wmemset but not memset?
> >
> The current memset implementation uses mvc instruction.
> It is optimized for setting one byte, which is still the preferred way
> for memset. But setting four bytes with mvc is not optimized in this
> way and thus only wmemset is optimized with vector instructions.
Why? I ran dryrun to see how often that happens. Results are bit
surprise to me as they mean that I could optimize memset bit more.
I knew that you could assume that its aligned to 8 bytes. Now I also see
that size is quite likely multiple of 4/8. See below for raw data.
Other characteristic is that average size is in hundreds of bytes so
vectorization pays off.
It looks that best approach for memset would be first write few bytes to
align and make size multiple of 4. Then it would be identical logic as
wmemset so you could just do jump to appropriate memset instruction.
I would use control flow like following, recursion there just because I
cannot change memset return value in c.
void *
memset(void *_x, int _c, size_t n)
{
char *x = (char *) _x;
unsigned char c = (unsigned char) _c;
if (n == 0)
return _x;
if (__glibc_unlikely ((((uintptr_t) x) | n) % 4 != 0))
{
if ((((uintptr_t) x) & 3))
{
if (((uintptr_t) x) & 3)
{
*x++ = c;
n--;
}
if (((uintptr_t) x) & 3)
{
*x++ = c;
n--;
}
if (((uintptr_t) x) & 3)
{
*x++ = c;
n--;
}
memset (x, c, n);
return _x;
}
if (n & 2)
{
*((uint16_t *)(x + n - 2)) = c * 0x01010101;
n -= 2;
}
if (n & 1)
x[--n] = c;
return (void *) wmemset (x, c * 0x01010101, n / 4);
}
else
return (void *) wmemset (x, c * 0x01010101, n / 4);
}
replaying bash
calls 1268
average capacity 161.6
suceed: 0.0%
size % 4 == 0: 56.1%
size % 8 == 0: 49.1% success probability 0.0%
average n: 161.6 n <= 0: 0.1% n <= 4: 0.1% n <= 8: 0.1% n <= 16: 0.1% n <= 24: 0.1% n <= 32: 3.6% n <= 48: 26.0% n <= 64: 48.7%
s aligned to 4 bytes: 100.0% 8 bytes: 100.0% 16 bytes: 0.1%
average *s access cache latency 0.9 l <= 8: 94.8% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0%
replaying awk
calls 59
average capacity 40.8
suceed: 0.0%
size % 4 == 0: 67.8%
size % 8 == 0: 67.8% success probability 0.0%
average n: 40.8 n <= 0: 1.7% n <= 4: 1.7% n <= 8: 5.1% n <= 16: 23.7% n <= 24: 40.7% n <= 32: 89.8% n <= 48: 89.8% n <= 64: 91.5%
s aligned to 4 bytes: 100.0% 8 bytes: 100.0% 16 bytes: 100.0%
average *s access cache latency 1.1 l <= 8: 98.3% l <= 16: 98.3% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0%
replaying ssh-add
calls 149
average capacity 139.0
suceed: 0.0%
size % 4 == 0: 37.6%
size % 8 == 0: 17.4% success probability 0.0%
average n: 139.0 n <= 0: 0.7% n <= 4: 26.2% n <= 8: 35.6% n <= 16: 72.5% n <= 24: 77.9% n <= 32: 83.2% n <= 48: 87.2% n <= 64: 89.9%
s aligned to 4 bytes: 89.3% 8 bytes: 87.9% 16 bytes: 87.2%
average *s access cache latency 0.4 l <= 8: 99.3% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0%
replaying ssh-keygen
calls 157
average capacity 144.8
suceed: 0.0%
size % 4 == 0: 37.6%
size % 8 == 0: 17.8% success probability 0.0%
average n: 144.8 n <= 0: 0.6% n <= 4: 25.5% n <= 8: 34.4% n <= 16: 69.4% n <= 24: 74.5% n <= 32: 79.6% n <= 48: 84.1% n <= 64: 87.3%
s aligned to 4 bytes: 89.2% 8 bytes: 87.9% 16 bytes: 87.3%
average *s access cache latency 0.3 l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0%
replaying /usr/lib/gcc/x86_64-linux-gnu/5/cc1
calls 48817
average capacity 142.5
suceed: 0.0%
size % 4 == 0: 99.5%
size % 8 == 0: 87.2% success probability 0.0%
average n: 142.5 n <= 0: 2.2% n <= 4: 6.0% n <= 8: 25.7% n <= 16: 29.2% n <= 24: 32.8% n <= 32: 35.8% n <= 48: 47.6% n <= 64: 53.6%
s aligned to 4 bytes: 100.0% 8 bytes: 99.7% 16 bytes: 73.6%
average *s access cache latency 55.4 l <= 8: 78.7% l <= 16: 86.0% l <= 32: 90.8% l <= 64: 95.8% l <= 128: 97.0%
replaying as
calls 424
average capacity 8829.0
suceed: 0.0%
size % 4 == 0: 100.0%
size % 8 == 0: 89.4% success probability 0.0%
average n: 8829.0 n <= 0: 0.2% n <= 4: 10.8% n <= 8: 10.8% n <= 16: 10.8% n <= 24: 10.8% n <= 32: 10.8% n <= 48: 10.8% n <= 64: 10.8%
s aligned to 4 bytes: 100.0% 8 bytes: 94.8% 16 bytes: 79.5%
average *s access cache latency 12.1 l <= 8: 77.1% l <= 16: 92.0% l <= 32: 97.2% l <= 64: 98.1% l <= 128: 98.1%
replaying ar
calls 387
average capacity 347.8
suceed: 0.0%
size % 4 == 0: 91.5%
size % 8 == 0: 87.6% success probability 0.0%
average n: 347.8 n <= 0: 0.3% n <= 4: 0.3% n <= 8: 8.5% n <= 16: 8.8% n <= 24: 8.8% n <= 32: 8.8% n <= 48: 8.8% n <= 64: 8.8%
s aligned to 4 bytes: 91.5% 8 bytes: 91.5% 16 bytes: 57.1%
average *s access cache latency 4.6 l <= 8: 87.6% l <= 16: 92.8% l <= 32: 96.9% l <= 64: 99.0% l <= 128: 99.5%
replaying ranlib
calls 372
average capacity 368.9
suceed: 0.0%
size % 4 == 0: 95.4%
size % 8 == 0: 94.4% success probability 0.0%
average n: 368.9 n <= 0: 0.3% n <= 4: 0.3% n <= 8: 0.8% n <= 16: 5.1% n <= 24: 5.1% n <= 32: 5.1% n <= 48: 5.1% n <= 64: 5.1%
s aligned to 4 bytes: 99.2% 8 bytes: 99.2% 16 bytes: 59.4%
average *s access cache latency 5.3 l <= 8: 78.5% l <= 16: 91.4% l <= 32: 97.8% l <= 64: 99.5% l <= 128: 99.5%
replaying /usr/bin/ld
calls 1815
average capacity 754.1
suceed: 0.0%
size % 4 == 0: 91.1%
size % 8 == 0: 89.5% success probability 0.0%
average n: 754.1 n <= 0: 0.1% n <= 4: 4.6% n <= 8: 8.4% n <= 16: 9.5% n <= 24: 10.0% n <= 32: 10.8% n <= 48: 19.4% n <= 64: 19.9%
s aligned to 4 bytes: 93.6% 8 bytes: 92.3% 16 bytes: 64.4%
average *s access cache latency 9.1 l <= 8: 92.0% l <= 16: 94.5% l <= 32: 96.0% l <= 64: 96.5% l <= 128: 96.7%
replaying mutt
calls 89
average capacity 1175.9
suceed: 0.0%
size % 4 == 0: 100.0%
size % 8 == 0: 100.0% success probability 0.0%
average n: 1175.9 n <= 0: 1.1% n <= 4: 1.1% n <= 8: 1.1% n <= 16: 1.1% n <= 24: 1.1% n <= 32: 1.1% n <= 48: 1.1% n <= 64: 1.1%
s aligned to 4 bytes: 100.0% 8 bytes: 100.0% 16 bytes: 100.0%
average *s access cache latency 61.0 l <= 8: 12.4% l <= 16: 42.7% l <= 32: 76.4% l <= 64: 77.5% l <= 128: 78.7%
replaying mc
calls 2650
average capacity 175.1
suceed: 0.0%
size % 4 == 0: 87.3%
size % 8 == 0: 83.2% success probability 0.0%
average n: 175.1 n <= 0: 0.0% n <= 4: 6.5% n <= 8: 9.9% n <= 16: 13.5% n <= 24: 13.5% n <= 32: 39.8% n <= 48: 59.8% n <= 64: 86.8%
s aligned to 4 bytes: 94.7% 8 bytes: 93.7% 16 bytes: 93.4%
average *s access cache latency 6.9 l <= 8: 90.5% l <= 16: 94.9% l <= 32: 97.5% l <= 64: 97.6% l <= 128: 97.6%
replaying gawk
calls 409
average capacity 36.9
suceed: 0.0%
size % 4 == 0: 98.0%
size % 8 == 0: 98.0% success probability 0.0%
average n: 36.9 n <= 0: 0.2% n <= 4: 0.2% n <= 8: 0.5% n <= 16: 2.7% n <= 24: 2.7% n <= 32: 95.8% n <= 48: 95.8% n <= 64: 96.1%
s aligned to 4 bytes: 100.0% 8 bytes: 100.0% 16 bytes: 100.0%
average *s access cache latency 0.4 l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0%