[PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
naohirot@fujitsu.com
naohirot@fujitsu.com
Thu Apr 15 12:20:58 GMT 2021
Hi Wilco-san,
Thanks for reviewing in detail technically!!
Now we have several topics to discuss.
So let me focus on the BTI in this mail. I'll answer other topics in later mail.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
>
> Thanks for the comprehensive reply, especially the graphs are quite useful!
> (I'd avoid adding generic_memcpy/memmove though since those are unoptimized
> C implementations).
OK, I'll withdraw the patch from the A64FX patch V2.
> > For small/medium copies, I needed to remove BTI macro from ASM ENTRY
> > in order to see the distinct performance difference between ASIMD and SVE.
> > I'll post the patch [14] with the A64FX second patch.
>
> I'm not sure I understand - the BTI macro just emits a NOP hint so it is harmless.
> We always emit it so that it works seamlessly when BTI is enabled.
Yes, I observed that just " hint #0x22" is inserted.
The benchtest results show that the A64FX performance of size less than 100B with
BTI is slower than ASIMD, but without BTI is faster than ASIMD.
And the A64FX performance of 512B with BTI 4Gbps/sec slower than without BTI.
With BTI, source code [4]
[1] https://drive.google.com/file/d/1LlyQOq7qT4d0-54uVzUtYMMMDgIiddEj/view
[2] https://drive.google.com/file/d/1C2pl-Iz_-18mkpuQTk1PhEHKsd5x0wWo/view
[3] https://drive.google.com/file/d/1eg_p1_b619KN7XLmOpxqcoI3c9o4WXd-/view
[4] https://github.com/NaohiroTamura/glibc/commit/0f45fff654d7a31b58e5d6f4dbfa31d6586f8cc2
Without BTI, source code [8]
[5] https://drive.google.com/file/d/1Mf7wxwgGb5yYBJo1eUxqvjrkp9O4EVVJ/view
[6] https://drive.google.com/file/d/1rgfFmWsM4Q3oDK8aYa_GjEQWttS0pOBF/view
[7] https://drive.google.com/file/d/1hF7oevP-MERrQ04yajtEUY8CSWe8V2EX/view
[8] https://github.com/NaohiroTamura/glibc/commit/c204a74971b3d34680964bc52ac59264b14527e3
I executed the same test on ThanderX2, the result had very little difference
between with BTI and without BTI as you mentioned.
So if distinct degradation happens only on A64FX, I'd like to add another
ENTRY macro in sysdeps/aarch64/sysdep.h such as:
#define ENTRY_ALIGN_NO_BTI(name, align) \
.globl C_SYMBOL_NAME(name); \
.type C_SYMBOL_NAME(name),%function; \
.p2align align; \
C_LABEL(name) \
cfi_startproc; \
CALL_MCOUNT
Or I'd like to change memcpy_a64fx.S and memset_a64fx.S without ENTRY macro such as:
.globl __memcpy_a64fx
.type __memcpy_a64fx, %function
.p2align 6
__memcpy_a64fx:
cfi_startproc
CALL_MCOUNT
What do you think?
> > And also somehow on A64FX as well as on ThunderX2 machine,
> > memcpy-random didn't start due to mprotect error.
>
> Yes it looks like the size isn't rounded up to a pagesize. It really needs the extra
> space, so changing +4096 into getpagesize () will work.
OK, I've already applied it [8].
Thanks!
Naohiro
More information about the Libc-alpha
mailing list