Bug 28432 - Aarch64 memcpy used on device-memory
Summary: Aarch64 memcpy used on device-memory
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: 2.32
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-10-07 12:39 UTC by Jon Nettleton
Modified: 2021-11-29 16:18 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
patch to memcpy memmove for better device-memory compatibility (769 bytes, patch)
2021-10-07 12:39 UTC, Jon Nettleton
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jon Nettleton 2021-10-07 12:39:28 UTC
Created attachment 13713 [details]
patch to memcpy memmove for better device-memory compatibility

It was first reported 4 years ago with the Macchiatobin writing to a memory mapped framebuffer of a PCIe device.  The error was narrowed down to overlapping stps causing device memory to be 0'd out or not written at all.  There were many discussions on if it was valid to use mem* functions on device memory mapped as uncached / writecombined.  Recently I tracked down a rendering problem on the HoneyComb LX2K to a similar failure.  Since between the 3 SOCs, the only similarity is the Cortex-A72 cores (They all have different combinations of CCN's and PCIe IP) I started looking a bit more into possible causes.  I came across this documentation regarding how the Cortex-A72 does ACE transfers, https://developer.arm.com/documentation/100095/0001/way1381846851421

because I had already narrowed down the failure to memcpy's of 97-110 size unaligned copies I realized that it was always the last 2 stp's of the copy96 routine. Since the ordering should not matter, I instead moved the backwards copy to happen first which would then allow from what I understand of the document above the 4 forward progressing stp's could be sent as a single 4x128bit WRAP write.

This does fix the issue I was trying to solve in both my specific test case as well as a few real world rendering failures.  Since we are only re-ordering stp's there should be no functional or performance regressions, and all the glibc test's do pass.
Comment 1 Adhemerval Zanella 2021-10-07 12:59:05 UTC
Could you please send the patch to libc-alpha? If may check the contributor maillist [1].  Patch discussion are done on the maillist and usually patches attached on bug reports are not following up without a proper submission to maillist.

[1] https://sourceware.org/glibc/wiki/Contribution%20checklist.
Comment 2 Wilco 2021-10-13 12:36:35 UTC
(In reply to Jon Nettleton from comment #0)

> This does fix the issue I was trying to solve in both my specific test case
> as well as a few real world rendering failures.  Since we are only
> re-ordering stp's there should be no functional or performance regressions,
> and all the glibc test's do pass.

It's not clear what the underlying issue is, but I don't believe reordering stores would be sufficient - it might depend on alignment, state of the writebuffer, cache miss/hit and many other factors that software can't control.

So my first question is whether this works around the issue 100%, ie. for all possible combinations of src and dst alignment, overlap (for memmove) and sizes?
Comment 3 Jon Nettleton 2021-10-13 16:36:57 UTC
(In reply to Wilco from comment #2)
> (In reply to Jon Nettleton from comment #0)
> 
> > This does fix the issue I was trying to solve in both my specific test case
> > as well as a few real world rendering failures.  Since we are only
> > re-ordering stp's there should be no functional or performance regressions,
> > and all the glibc test's do pass.
> 
> It's not clear what the underlying issue is, but I don't believe reordering
> stores would be sufficient - it might depend on alignment, state of the
> writebuffer, cache miss/hit and many other factors that software can't
> control.
> 
> So my first question is whether this works around the issue 100%, ie. for
> all possible combinations of src and dst alignment, overlap (for memmove)
> and sizes?

This fixes a singular issue that we were seeing and was reproducible with this modified version of a previous test program that was done when the problem was first detected on the Macchiatobin.  https://gist.github.com/jnettlet/80f8d09d01c0dc0ffc0122f36ed78de6

The issue addressed here is purely overlapping writes to device memory of sizes 97-110 would have bytes starting or immediately after 64 bytes that would be either zero'd or unchanged.

This behaviour could be successfully worked around by simple breaking up memcpy's of that size range into smaller copies.  https://gist.github.com/jnettlet/80f8d09d01c0dc0ffc0122f36ed78de6#file-fb_test-c-L98

This does not solve all GPU rendering issues, I have another patch for that but this does resolve the issues in my test program, as well as rendering errors that were seen in glmark2 -b terrain as well as the game minetest that still existed with my mesa patch.  Because these issues still existing with my patched mesa I believe this is a separate bug.

This is another developer that reported the issues to me testing my glibc with this patch included. https://twitter.com/sahajsarup/status/1445813643744985093
Comment 4 Szabolcs Nagy 2021-10-13 18:28:25 UTC
the issue may be how the memory was mapped: it is invalid
to map with normal memory attributes if overlapping writes
are not supported the way hw expects normal memory to work.
(i'm unfortunately not an expert on this and i don't know
what's going on in the drivers, but any further information
would be helpful so others can understand the issue better)
Comment 5 Jon Nettleton 2021-10-14 04:15:57 UTC
(In reply to Szabolcs Nagy from comment #4)
> the issue may be how the memory was mapped: it is invalid
> to map with normal memory attributes if overlapping writes
> are not supported the way hw expects normal memory to work.
> (i'm unfortunately not an expert on this and i don't know
> what's going on in the drivers, but any further information
> would be helpful so others can understand the issue better)

This is not a new issue.  It was first discussed in 2018 on this very long kernel mailing list thread.  https://lore.kernel.org/lkml/87h8k7h8q9.fsf@linux.ibm.com/T/

It was originally written off as a PCIe implementation problem.  Well I have now seen the issue personally on three different combinations of Cortex-a72, CCN's, and PCIe IP, from multiple manufacturer's. The only common part of all the implementations is the Arm implementation of the Cortex-A72.  I will note, but can not personally verify that most likely the Kunpeng SOC also exhibits this behavior because they also have similar patches to the ones I have written.

Someone could spend another 4 years trying to find the exact cause of this behavior, since it has just sat around for the last 4 years and ignored my guess is that this is not going to happen. Half of the engineers claim that mapping device memory as uncached on Arm is the same as write-combine and another half that claim that device memory can never be treated as normal memory.  Truth be told unless we are going to rework the entire Linux graphics stack we need some sort of compatibility with how the software currently works.

If we don't want to patch the main aarch64 memcpy then I will be happy to submit a ifunc implementation.
Comment 6 David Nichols 2021-10-14 08:53:53 UTC
This patch fixes real issues on the platforms listed above, without it users need to run a custom glibc, which is a pain.

The patch has no performance penalty and no impact on other platforms - please take into account the benefit to the impacted user base and the risk of regressions (none) when considering this patch for merging.

In the absolute worst case it does no harm, and at this point it looks like finding out the root cause in the HW is not going to happen in the near future.
Comment 7 Szabolcs Nagy 2021-10-25 16:46:59 UTC
for the patch to be accepted,
- it has to be posted to libc-alpha
- we need reasonable confidence that it fixes an issue
- there should be no relevant performance regression

looking at
https://lore.kernel.org/lkml/CAK8P3a1CdL5ZVvWPMriOknTCgkfbR4F-a6dRZ7ap9+NiD8H5cg@mail.gmail.com/
and followup mail i'd expect this to be a hw issue
somewhere between the core and the pci framebuffer.

i think we need to document what hw was this bug
observed on (all components involved).

(ideally it would be demonstrated which hw is at fault
e.g. via traces of intermediate protocols or trying
different combinations of hw components.)

(note that glibc is not the only runtime library with
this memcpy code, and the compiler may also generate
similar code patterns)
Comment 8 Jon Nettleton 2021-10-26 07:01:14 UTC
(In reply to Szabolcs Nagy from comment #7)
> for the patch to be accepted,
> - it has to be posted to libc-alpha

Yes we will get there.

> - we need reasonable confidence that it fixes an issue

Well we have one test program that was born from the original reported issue 4 years ago, and 2 real world programs with multiple confirmations that it fixes the issue

> - there should be no relevant performance regression

Yes I agree, however I don't see how change the store order is going to effect performance in this case. Was the ordering questioned when this implementation of memcpy was submitted?

> 
> looking at
> https://lore.kernel.org/lkml/CAK8P3a1CdL5ZVvWPMriOknTCgkfbR4F-
> a6dRZ7ap9+NiD8H5cg@mail.gmail.com/
> and followup mail i'd expect this to be a hw issue
> somewhere between the core and the pci framebuffer.

> 
> i think we need to document what hw was this bug
> observed on (all components involved).

That has been the assumption but there are really only a handful of systems that can be used to test and validate this. Ultimately it was originally blamed on the Synopsis PCIe IP, however with the LX2160a V1 silicon we have reproduced the issue on a completely different PCIe IP. Additionally the Macchiatobin with the Armada 8040 and CN913x based systems have completely different CCN implementations, but similar PCIe IP as the LX2160 but manufactured differently. Really the only similarities are the Cortex-A72 cores and how ACE talks to the coherency fabric which turns connects to the PCIe over AXI.

> 
> (ideally it would be demonstrated which hw is at fault
> e.g. via traces of intermediate protocols or trying
> different combinations of hw components.)

The only combination I have not tried are different Cortex-A* cores, just because there aren't many other hardware options to test against. 

> 
> (note that glibc is not the only runtime library with
> this memcpy code, and the compiler may also generate
> similar code patterns)

Yes I understand this, and I already have a patch for the kernel. The kernel is slightly different since it does have a specific memcpy_io implementation. I do understand the compiler can also generate similar code patterns however just changing glibc's memcpy has resolved all the known issues and bugs reported to me on our platforms.