This is the mail archive of the mailing list for the newlib project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: MMU Off / Strict Alignment

On Wed, Dec 18, 2013 at 7:04 AM, Joel Sherrill
<> wrote:

> I thought it was a fairly common assumption in newlib to avoid
> unaligned accesses in mem/str methods. Unaligned accesses may
> be less efficient or generate a trap which puts more requirements
> on the underlying environment.

The C library specification for memcpy() accepts arguments of type
void *. As such, there is no such thing as an unaligned argument. The
issue here has to do with an optimization in which 16-bit and/or
32-bit load/stores (or in extreme cases, vector unit loads/stores) are
used to optimize the copy.

The operative word there is *optimization*. Conceptually, memcpy()
moves byte sequences. There is no such thing as an unaligned byte

The current implementation uses load/store operations without first
checking the alignment of the incoming pointers. The problem with the
current *discussion* is an unquantified assertion that this is more
efficient than using test and (possibly mispredicted) branch. Since
that assertion has been made on a number of earlier processors and
invariably turned out to be wrong, I'd really like to see numbers. At
best, even if the MMU supports unaligned loads, they have the effect
of doubling the number of L1 data cache references on most machines.
That will be a win for the first word or two, and a lose for every
word thereafter.

Several things, I think, are wrong with the current technical argument:

1. There is an assertion that there will be a branch mispredict in the
common case. That's a silly argument. If it is true, then re-order the
assembly code to fix that!

2. Short memcpy/bcopy calls almost invariably involve constant
lengths, and those can be (and usually are) statically "cherry picked"
by GCC, clang, and other modern compilers to use specialized
implementations. For example to call an unoptimized bytewise copy
implementation of memcpy() in cases where the expected cost of the
test+branch is unwarranted, or even to inline the copy altogether
using an intrinsic, whereupon the compiler can optimize it the library
routine is only called (in practice) in a mid-sized to long move,
which is exactly the case for which that test is well justified.

3. All current and anticipated implementations of AARCH perform
multi-issue and some degree of speculative issue. Re-ordering the code
to favor the correct prediction should be sufficient in practice to
let the reservation stations within the processor hide the
(unmeasured) delay of the test+branch. Especially so since there are
other setup instructions at the front of memcpy() and the initial
instructions in a properly optimized memcpy() are all non-dependent
load instructions (which are very speculation-friendly). This
obviously needs measurement.

In the end, you *may* pay an extra cycle or two to implement memcpy()
correctly without reliance on the MMU, but I'd have to see
measurements to be convinced of even that in light of the optimizer
substituting intrinsics.

The real problem with the MMU-reliant implementation is that it is
likely to penalize *correctly* aligned copies in certain cases that
have nothing to do with the bootstrap issue. Example: copying an array
of 16-bit quantities from a (addr % 4 == 2) boundary to another (addr
% 4 == 2) boundary will incur the MMU alignment fixup overhead at both
source and destination if 32-bit or 64-bit load/store operations are
used. On a machine that can merge loads and stores in the LSU (as I
suspect this one can, but I haven't measured it), and which performs
multi-issue (as this one does), a simple loop using 16 bit loads and
stores might well turn out to be faster. An approach using 64-bit
load/store operations and shifts may or may not turn out to be faster.
Performance on supserscalar machines is a very entertaining and
perplexing thing; in an algorithm like memcpy(), it's entirely a
function of the number of I/Os off of the processor and the
sophistication of the LSU implementation, not the cleverness of the

Corinna asks why we shouldn't just use two library versions. That
could be done. Another possibility is just to have two memcpy()
implementations in the one library, perhaps called memcpy() and
__noMMU_memcpy(). The main issue there is that we will lose all of the
benefit of compiler optimizations, because memcpy() is recognized as a
special case by optimizing compilers. A more serious answer is that we
probably shouldn't assume that bootstrap code and "main" code on an
embedded system are compiled in two different library environments.
Just as an illustration, I've been working on a controller design
lately where we are strongly tempted to move from Cortex M to Cortex A
 because of the MMU. Right now the code is compiled as a single blob,
and I wouldn't expect to change that merely because we use an MMU. The
point is that the presence of a required MMU doesn't imply that we're
dealing with a compilation environment in the style of separate
processes. This is especially true for ARM, where MMU-enabled parts
are getting used in traditionally embedded applications with
surprising frequency.

I also think there is a different issue to consider here. Does the
reliance on unaligned loads/stores impact the behavior of code outside
of memcpy()? That is: is the processor placed in a state where *any*
unaligned load/store will be silently honored?  If so, that's a very
bad thing, because it removes one of the hardware checks that helps to
identify code errors and certain mistakes made by security
penetrations. It also impedes GC support. So if alignment checks are
disabled in general, I personally feel that that is a priority zero
issue requiring correction. Once corrected, the current implementation
of memcpy() will be a side issue. I'm inclined to think that we
shouldn't exploit a likely-transient security flaw in the memcpy()

All of this aside, my guess is that memcpy() will actually be faster
overall under typical optimization assumptions if use of the MMU is


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]