This is the mail archive of the
mailing list for the glibc project.
RE: framebuffer corruption due to overlapping stp instructions on arm64
- From: David Laight <David dot Laight at ACULAB dot COM>
- To: 'Mikulas Patocka' <mpatocka at redhat dot com>
- Cc: 'Ard Biesheuvel' <ard dot biesheuvel at linaro dot org>, Ramana Radhakrishnan <ramana dot gcc at googlemail dot com>, Florian Weimer <fweimer at redhat dot com>, "Thomas Petazzoni" <thomas dot petazzoni at free-electrons dot com>, GNU C Library <libc-alpha at sourceware dot org>, Andrew Pinski <pinskia at gmail dot com>, "Catalin Marinas" <catalin dot marinas at arm dot com>, Will Deacon <will dot deacon at arm dot com>, "Russell King" <linux at armlinux dot org dot uk>, LKML <linux-kernel at vger dot kernel dot org>, linux-arm-kernel <linux-arm-kernel at lists dot infradead dot org>
- Date: Mon, 6 Aug 2018 10:18:33 +0000
- Subject: RE: framebuffer corruption due to overlapping stp instructions on arm64
- References: <alpine.LRH.email@example.com> <CA+=Sn1mWkjuwVnjw6OWWUM=UcP76bdFa680FebCseewHfx3NpA@mail.gmail.com> <firstname.lastname@example.org> <CAJA7tRZbmnZq7RfvQeYEy_a1ZObWqpFpVdvgsXgsioQ3RyPMuA@mail.gmail.com> <CAKv+Gu97QvwoLLK_zueiA_gjg_4Q5cqU4YVUyHUVFFfffdyJaw@mail.gmail.com> <f696ebe8605840e3bb04bb78b60a6cfa@AcuMS.aculab.com> <alpine.LRH.email@example.com> <a1564e8d091648bcad9b5ec58ab6cc95@AcuMS.aculab.com> <alpine.LRH.firstname.lastname@example.org>
From: Mikulas Patocka
> Sent: 05 August 2018 15:36
> To: David Laight
> There's an instruction movntdqa (and vmovntdqa) that can actually do
> prefetch on write-combining memory type. It's the only instruction that
> can do it.
> It this instruction is used on non-write-combining memory type, it behaves
> like movdqa.
> I benchmarked it on a processor with ERMS - for writes to the framebuffer,
> there's no difference between memcpy, 8-byte writes, rep stosb, rep stosq,
> mmx, sse, avx - all this method achieve 16-17 GB/s
The combination of write-combining, posted writes and a fast PCIe slave
are probably why there is little difference.
> For reading from the framebuffer:
> 323 MB/s - memcpy (using avx2)
> 91 MB/s - explicit 8-byte reads
> 249 MB/s - rep movsq
> 307 MB/s - rep movsb
You must be getting the ERMS hardware optimised 'rep movsb'.
> 90 MB/s - mmx
> 176 MB/s - sse
> 4750 MB/s - sse movntdqa
> 330 MB/s - avx
avx512 is probably faster still.
> 5369 MB/s - avx vmovntdqa
> So - it may make sense to introduce a function memcpy_from_framebuffer()
> that uses movntdqa or vmovntdqa on CPUs that support it.
For kernel space it ought to be just memcpy_fromio().
Can you easily repeat the tests using a non-write-combining map of the
same PCIe slave?
I can probably run the same measurements against our rather leisurely
FPGA based PCIe slave.
IIRC PCIe reads happen every 128 clocks of the cards 62.5MHz clock,
increasing the size of the registers makes a significant different.
I've not tried mapping write-combining and using (v)movntdaq.
I'm not sure what effect write-combining would have if the whole BAR
were mapped that way - so I'll either have to map the physical addresses
twice or add in another BAR.
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)