This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: [PATCH][AArch64] Enable _STRING_ARCH_unaligned
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: "'Andrew Pinski'" <pinskia at gmail dot com>
- Cc: "GNU C Library" <libc-alpha at sourceware dot org>
- Date: Thu, 20 Aug 2015 17:29:18 +0100
- Subject: RE: [PATCH][AArch64] Enable _STRING_ARCH_unaligned
- Authentication-results: sourceware.org; auth=none
- References: <000101d0db53$e96233c0$bc269b40$ at com> <CA+=Sn1mEcUHtP+1JkOy+7JU6LvbDrbZDLuSEcD89GQF2OpuKDQ at mail dot gmail dot com>
> Andrew Pinski wrote:
> On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> > +
> > +/* AArch64 implementations support efficient unaligned access. */
> > +#define _STRING_ARCH_unaligned 1
>
> I don't think this is 100% true. On ThunderX, an unaligned store or
> load takes an extra 8 cycles (a full pipeline flush) as all unaligned
> load/stores have to be replayed.
> I think we should also benchmark there to find out if this is a win
> because I doubt it is a win but I could be proved wrong.
That's bad indeed, but it would still be better than doing everything
one byte at a time. Eg. resolv/arpa/nameser.h does:
#define NS_GET32(l, cp) do { \
const u_char *t_cp = (const u_char *)(cp); \
(l) = ((u_int32_t)t_cp[0] << 24) \
| ((u_int32_t)t_cp[1] << 16) \
| ((u_int32_t)t_cp[2] << 8) \
| ((u_int32_t)t_cp[3]) \
; \
(cp) += NS_INT32SZ; \
} while (0)
This becomes an unaligned load plus byteswap with _STRING_ARCH_unaligned
which should be faster even on ThunderX.
> Are there benchmarks for each of the uses of _STRING_ARCH_unaligned
> so I can do the benchmarking on ThunderX?
I don't believe there are.
> Also I don't see any benchmark results even for any of the other
> AARCH64 processors.
It's obvious it is a big on most of the uses of _STRING_ARCH_unaligned.
Eg. consider the encryption code in crypt/md5.c:
#if !_STRING_ARCH_unaligned
if (UNALIGNED_P (buffer))
while (len > 64)
{
__md5_process_block (memcpy (ctx->buffer, buffer, 64), 64, ctx);
buffer = (const char *) buffer + 64;
len -= 64;
}
else
#endif
So basically you end up doing an extra memcpy if unaligned access is not
supported. This means you'll not only do the unaligned loads anyway, but
you'll also do an extra aligned load and store to the buffer.
GLIBC use of _STRING_ARCH_unaligned is quite messy and would benefit from
a major cleanup, however it's quite clear enabling this is a win on overall.
Wilco