This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function



On 11/10/2017 18:20, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
>> LGTM, for the pointer brought by Szabolcs:
>>
>>  - I still prefer to have the macros on one file instead of multiple ones.
>>    It makes checks for all the possible code patch easier and one can
>>    theoretical build the desirable memset by just define the required
>>    macros instead of pulling from source code from different files.
> 
> At the design level, I don't think that's a good idea. It's essential that this
> code is easy to read and understand. It must be 100% correct and is
> extremely performance sensitive to minor changes (eg. instruction order
> or alignment can have major impact). Inline macros and lots of complex
> #ifdefs in the code mean getting this right becomes almost impossible.
> I really don't want to create large blobs of unmaintainable assembly code
> like several other targets.

My idea is to prevent have different code paths in different files that
are 'injected' by macros or ifdefs.  Getting all of them in one place is
better imho than spread over multiple files.  Now, what you are advocating
is a different topic: whether the modifications of the generic
implementation are valuable.

> 
> Note we also should benchmark this on various other cores to see what
> the impact is. I wrote this memset code for specific reasons - changing that
> could have a big impact on some cores, so we need to show this doesn't
> cause regressions.

Ideally it would prefer to have a more concise selection as:

   1. A memset that reads dczid_el0 using mrs (as the default one).
      This will be selected as default and thus should not incur in any
      regression.

   2. A memset that reads the zva size from global variable and is selected
      solely by falkor (and ideally it would handle 64 and 128 cacheline
      sizes).

   3. no zva variant in the case of zva being 0.

> 
> Also we need to decide on how to read the ZVA size. I don't think there is
> any point in supporting that many options (reading it, using a global, using
> an ifunc and not using ZVA at all). The very first memset had a few of these
> options, and I removed them precisely because it created way too many
> permutations to test and benchmark. So if we add ifuncs, then we shouldn't
> need to use a global too. For the dynamic linker we could just use a basic
> memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
> and no ZVA).

Siddhesh can help me out, but the idea is *precisely* that Falkor is faster
using a global variable than accessing the dczid_el0 and its shown by its
benchmarks.  Now, the discussion is indeed if this microarchitecture difference
is worth adding ifuncs variants.

For loader it is already not using ifunc (and afaik there is not support
to do so).

> 
> Finally we'll need to look into more detail at the new memcpy benchmarks -
> while looping through memory seems like a good idea, it appears like it
> only increments by 1. So at first sight it's still testing the exact same thing as
> the existing benchmarks - all data is always cached in L1. For memset I guess
> we're still missing a randomized benchmark based on real trace data.

I think we can move this discussion to the benchmark thread itself.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]