This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: framebuffer corruption due to overlapping stp instructions on arm64

From: Mikulas Patocka <mpatocka at redhat dot com>
To: David Laight <David dot Laight at ACULAB dot COM>
Cc: "'Ard Biesheuvel'" <ard dot biesheuvel at linaro dot org>, Ramana Radhakrishnan <ramana dot gcc at googlemail dot com>, Florian Weimer <fweimer at redhat dot com>, Thomas Petazzoni <thomas dot petazzoni at free-electrons dot com>, GNU C Library <libc-alpha at sourceware dot org>, Andrew Pinski <pinskia at gmail dot com>, Catalin Marinas <catalin dot marinas at arm dot com>, Will Deacon <will dot deacon at arm dot com>, Russell King <linux at armlinux dot org dot uk>, LKML <linux-kernel at vger dot kernel dot org>, linux-arm-kernel <linux-arm-kernel at lists dot infradead dot org>
Date: Tue, 7 Aug 2018 10:07:28 -0400 (EDT)
Subject: RE: framebuffer corruption due to overlapping stp instructions on arm64
References: <alpine.LRH.2.02.1808021242320.31834@file01.intranet.prod.int.rdu2.redhat.com> <CA+=Sn1mWkjuwVnjw6OWWUM=UcP76bdFa680FebCseewHfx3NpA@mail.gmail.com> <9acdacdb-3bd5-b71a-3003-e48132ee1371@redhat.com> <CAJA7tRZbmnZq7RfvQeYEy_a1ZObWqpFpVdvgsXgsioQ3RyPMuA@mail.gmail.com> <CAKv+Gu97QvwoLLK_zueiA_gjg_4Q5cqU4YVUyHUVFFfffdyJaw@mail.gmail.com> <f696ebe8605840e3bb04bb78b60a6cfa@AcuMS.aculab.com> <alpine.LRH.2.02.1808030759480.12341@file01.intranet.prod.int.rdu2.redhat.com> <a1564e8d091648bcad9b5ec58ab6cc95@AcuMS.aculab.com> <alpine.LRH.2.02.1808051018360.23136@file01.intranet.prod.int.rdu2.redhat.com> <51a6c4e102ad4193b3f42498f0ff11a4@AcuMS.aculab.com>


On Mon, 6 Aug 2018, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 05 August 2018 15:36
> > To: David Laight
> ...
> > There's an instruction movntdqa (and vmovntdqa) that can actually do
> > prefetch on write-combining memory type. It's the only instruction that
> > can do it.
> > 
> > It this instruction is used on non-write-combining memory type, it behaves
> > like movdqa.
> > 
> ...
> > I benchmarked it on a processor with ERMS - for writes to the framebuffer,
> > there's no difference between memcpy, 8-byte writes, rep stosb, rep stosq,
> > mmx, sse, avx - all this method achieve 16-17 GB/s
> 
> The combination of write-combining, posted writes and a fast PCIe slave
> are probably why there is little difference.
> 
> > For reading from the framebuffer:
> >  323 MB/s - memcpy (using avx2)
> >   91 MB/s - explicit 8-byte reads
> >  249 MB/s - rep movsq
> >  307 MB/s - rep movsb
> 
> You must be getting the ERMS hardware optimised 'rep movsb'.
> 
> >   90 MB/s - mmx
> >  176 MB/s - sse
> > 4750 MB/s - sse movntdqa
> >  330 MB/s - avx
> 
> avx512 is probably faster still.
> 
> > 5369 MB/s - avx vmovntdqa
> > 
> > So - it may make sense to introduce a function memcpy_from_framebuffer()
> > that uses movntdqa or vmovntdqa on CPUs that support it.
> 
> For kernel space it ought to be just memcpy_fromio().

I meant for userspace. Unaccelerated scrolling is still painfully slow 
even on modern computers because of slow framebuffer read. If glibc 
provided a function memcpy_from_framebuffer() that used movntdqa and the 
fbdev Xorg driver used it, it would help the users who use unaccelerated 
drivers for some reason.

> Can you easily repeat the tests using a non-write-combining map of the
> same PCIe slave?

I mapped the framebuffer as uncached and these are the results:

reading from the framebuffer:
318 MB/s - memcpy
 74 MB/s - explicit 8-byte reads
 73 MB/s - rep movsq
 11 MB/s - rep movsb
 87 MB/s - mmx
173 MB/s - sse
173 MB/s - sse movntdqa
323 MB/s - avx
284 MB/s - avx vmovntdqa

zeroing the framebuffer:
 19 MB/s - memset
154 MB/s - explicit 8-byte writes
152 MB/s - rep stosq
 19 MB/s - rep stosb
152 MB/s - mmx
306 MB/s - sse
621 MB/s - avx

copying data to the framebuffer:
618 MB/s - memcpy (using avx2)
152 MB/s - explicit 8-byte writes
139 MB/s - rep movsq
 17 MB/s - rep movsb
154 MB/s - mmx
305 MB/s - sse
306 MB/s - sse movntdqa
619 MB/s - avx
619 MB/s - avx movntdqa

> I can probably run the same measurements against our rather leisurely
> FPGA based PCIe slave.
> IIRC PCIe reads happen every 128 clocks of the cards 62.5MHz clock,
> increasing the size of the registers makes a significant different.
> I've not tried mapping write-combining and using (v)movntdaq.
> I'm not sure what effect write-combining would have if the whole BAR
> were mapped that way - so I'll either have to map the physical addresses
> twice or add in another BAR.
> 
> 	David

Mikulas

Follow-Ups:
- RE: framebuffer corruption due to overlapping stp instructions on arm64
  - From: David Laight

References:
- framebuffer corruption due to overlapping stp instructions on arm64
  - From: Mikulas Patocka
- Re: framebuffer corruption due to overlapping stp instructions on arm64
  - From: Andrew Pinski
- Re: framebuffer corruption due to overlapping stp instructions on arm64
  - From: Florian Weimer
- Re: framebuffer corruption due to overlapping stp instructions on arm64
  - From: Ramana Radhakrishnan
- Re: framebuffer corruption due to overlapping stp instructions on arm64
  - From: Ard Biesheuvel
- RE: framebuffer corruption due to overlapping stp instructions on arm64
  - From: David Laight
- RE: framebuffer corruption due to overlapping stp instructions on arm64
  - From: Mikulas Patocka
- RE: framebuffer corruption due to overlapping stp instructions on arm64
  - From: David Laight
- RE: framebuffer corruption due to overlapping stp instructions on arm64
  - From: Mikulas Patocka
- RE: framebuffer corruption due to overlapping stp instructions on arm64
  - From: David Laight

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]