[PATCH] more problems with newlib/libc/machine/m68k/memcpy.S
Wed Feb 10 10:30:00 GMT 2010
On Tue, Feb 09, 2010 at 05:10:38PM -0800, Aaron J. Grier wrote:
> On Tue, Feb 09, 2010 at 03:15:11PM +0100, Josef Wolf wrote:
> > 3. -mcpu32 seems to imply -mc68020. So the check for alignment capabilities
> > gives a wrong result for cpu32. BTW: alignment capabilities depend not
> > only on the CPU. It is also dependant on bus width and how the memory is
> > connected.
> it's been this way for years and is arguably incorrect, as neither
> instruction set is a superset of the other. (this is a big can of worms
> with the way 68k is currently set up under gcc. I'd love to fix it, but
> I'm not holding my breath for the necessary funding.)
I don't think there is a proper way to fix this, since it depends not only
on the CPU, but also on how memory is connected. IMHO, the best option is to
be conservative. It's better to have suboptimal code on some CPU than get
address errors on some other CPU.
BTW: Are there any CPUs out there that would give address errors for
long-word access on word-addresses? If not, the ugly ALIGN_BITS dependency
could be dropped.
> even if 8-bit memory is connected to cpu32, I believe the SIM can handle
> 16- and 32-bit transfers automatically, and with lower overhead since it
> can do the transfers back-to-back without intervening instruction fetch.
Yeah, but it still can't do word/long access to odd addresses.
> > IMHO, the correct algorithm would be like this:
> > 1. Align dest in any case, no matter what CPU we have. This will do no
> > harm to any CPU, since all CPUs can write fast to long-word addresses.
> > 2. After dest is aligned, check whether src is aligned also. If it is aligned,
> > we can use optimized algorithm. If not, fall back to bytewise copy. This
> > should have been the response to the error reported in the thread mentioned
> > above.
> > 3. Some hardware (like cpu32 with 16bit bus) can do long-word access to word
> > addresses without speed penalty. With such hardware, having src on an even
> > address is enough to use the optimized algorithm.
> > BTW: I think this depends not only on the CPU core, but also on how memory
> > is connected. I have included 16bit-alignment into the patch anyway.
> > We can drop it if it turns out to be true that dependence on the CPU
> > is the wrong thing to do here.
> if you're going to optimize for cpu32, see if you can optimize the copy
> loops into a single word instruction followed by a dbxx instruction.
> this avoids instruction fetches during the loop and increases bus
> throughput substantially.
In the current code, this is currently done for the byte-sized copy for
non-coldfire. But the unrolled long-word copy loop will still do the fetches,
because, umm, it is unrolled. Which CPUs do actually benefit from the
loop-unroll? You're right: for cpu32 it would actually be better _not_
to unroll the loop.
Can we make a list which CPUs would benefit from unrolling and which would
be better left with the compact loop?
> I have also noticed that there is a point of diminishing returns for
> jumping through alignment hoops. depending on the CPU speed, it may be
> faster to do a zero-overhead byte copy for small transfers rather than
> go through alignment setups.
Does this really depend on CPU speed? I think CPU-type and existence of
caches is the distinguishing factor here.
More information about the Newlib