[PATCH 2/5] benchtests: Add new random cases to bench-memcpy-random.c

Thu Aug 26 17:06:29 GMT 2021

Hi Noah,

> Since on x86_64 at least we use memmove to implement memcpy there are a 
> few points where the logic revolves around src < dst. Particularly in L(more_8x_vec)
> which branch on the condition w.o checking for overlap. I don't think its particularly
> fair for the random tests to essentially ignore that branch and imo a better reflection
> of real-world usage. An implementation could add some really branch heavy code
> in the case ignored by the current implementation that looked good on the fixed
> benchmarks but would perform worse in the randomized tests.

Yes such branches are a bad idea in general, particularly for small sizes. However
I don't understand why you think that branch is ignored given the current random
tests will already test it (so you should see it being expensive in profiles, and you
can easily check whether doing an overlap test ends up faster).

So why would there need to be any changes to the benchmark?

> For example in cases where src/dst 4k alias forward/backward looping can make a big difference. 
> On fixed tests it may be worth it to add branches that check for that, but in a randomized 
> environment where the branches can miss could say otherwise.

Again the existing benchmark should already test those branches - if they
execute frequently and are hard to predict, the results will be worse.

> I generally agree. But on the flip side there is essentially no way to tiebreak different
> alignment methods. For example in x86_64 aligning to 1x vs 2x VECs will improve about
> 50% of the fixed tests and harm 50%. I think this can be useful for determining "which method
> is expected to yield better results". There are other more branch heavy possible alignment
> configurations, which I don't think fixed tests would be able to accurately benchmark as they
> would be 100% predicted.

Yes the fixed tests are not very good for tuning at all since they completely ignore the
effects of branch prediction. I agree adding fixed sizes might be useful for alignment
tuning but then it's best to keep it to smaller sizes since that's where the alignment
overhead is the largest (say from 128 bytes to 4KB rather than 4KB to 128KB).

+  uint64_t src : 26;
+  uint64_t dst : 26;

>> This doesn't make any sense, where do we need offsets larger than 2^24?
>
> We need the extra bits so store max_size offsets to alternate direction.

You don't need the extra bits, 24 bits is more than enough.

>> This doesn't make sense since this is now copying outside of the region,
>> effectively enlarging the region but also always reading from one part and
>> writing to another. The data outside the region is uninitialized which means
>> it is using the zero page which gives unrepresentative results.
>
> I see. I didn't quite grasp the purpose of the max_size. (As seen by the
> fact that I renamed "length" to "max-alignment", now "region-size"). 
>
> Changed it to use max_size instead of MAX_TEST_SIZE and initialize
> 3 * max_size.

That still doesn't solve the issue. Basically the original test runs within
a block of max_size. You're effectively doubling or tripling it which means
the results are not comparable with the original code, and the results
cannot be compared between the variants in the new version either.

> I think its better to expand region size to 3x max_size as opposed to 2x
> max_size and have noise for store-forwarding (on x86 at least)

I don't see how that helps. The point of the benchmark is to be more
real-world than the fixed benchmarks. This means you will see lots of
mispredictions, cachemisses and other penalties. Yes, it will tell you if you
have a bad memcpy implemention. And that is a good thing rather than
something to be avoided.

+    /* Create a random set of copies with the given size and alignment
      distributions.  */
   for (i = 0; i < MAX_COPIES; i++)
     {
+      dst_offset  = dst_gt_src == -1
+                        ? (rand() & 1) ? MAX_TEST_SIZE + getpagesize() : 0
+                        : dst_offset;

> I don't quite understand what you mean? I change it to be
> 2 * max_size in v2 and src_offset always has a value of
> max_size if src > dst or randomized. This should prevent
> overlaps as essentially src has its own max_size buffer
> and dst has two possible max_size buffers, one below src
> and one above.

My point is that doesn't make sense even if we ignore issues in
the implementation. Basically it doubles the region size for 2 of
the 3 variants and triples it for the other, so they're not comparable.
When you can't even come to any conclusion about that src < dst
branch, what use does all this have?

The other thing is that with splitting the buffer into read-only and
write-only we go a step back towards fixed tests. A memcpy often
reads data that has recently been written to, so doing both reads
and writes from the same buffer is closer to real world scenarios.

> Fixed in v2

Is that this version (which doesn't appear to have fixed it) or are you
planning to post another one?

https://sourceware.org/pipermail/libc-alpha/2021-August/130493.html

 void
+do_test (json_ctx_t *json_ctx, size_t max_size, int dst_gt_src)
+{
+  size_t n;
+  memset (buf1, 1, max_size);

>> Note this doesn't initialize the extra data in the buffer which creates
>> odd performance behaviour due to the zero page.
>
> Fixed in v2.

The memset looks fixed indeed. However there now more bugs in the
fixed size case where it does:

+  n = init_copy(3 * max_size, dst_gt_src);

> I misunderstood the purpose of "length" which is essentially total
> region size. Imo "length" in context of memcpy generally refers to
> copy size. Renamed "region-size" and set to full region size used 
> which previously was 2 * max_size, now 3 * max_size.

"region-size" sounds reasonable, but you might need to update the
XML template to allow all these changes too.

>> It's much easier to change the for loop to i = 4096 to MAX_TEST_SIZE
>> and remove the various i * 1024. Then changing MAX_TEST_SIZE just works.
> Fixed in v2.

+  for (int i = 4096; i < MAX_TEST_SIZE; i = i * 2)

That looks much better indeed, but please use i <= MAX_TEST_SIZE.

Cheers,
Wilco