This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
- From: Andrew Pinski <pinskia at gmail dot com>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- Cc: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "siddhesh at sourceware dot org" <siddhesh at sourceware dot org>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>
- Date: Wed, 11 Oct 2017 14:44:48 -0700
- Subject: Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
- Authentication-results: sourceware.org; auth=none
- References: <DB6PR0801MB20531D1099A99E5A4DF1E042834A0@DB6PR0801MB2053.eurprd08.prod.outlook.com> <ac874303-6330-1a84-6970-7768ea408e96@linaro.org>
On Wed, Oct 11, 2017 at 2:39 PM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 11/10/2017 18:20, Wilco Dijkstra wrote:
>> Adhemerval Zanella wrote:
>>> LGTM, for the pointer brought by Szabolcs:
>>>
>>> - I still prefer to have the macros on one file instead of multiple ones.
>>> It makes checks for all the possible code patch easier and one can
>>> theoretical build the desirable memset by just define the required
>>> macros instead of pulling from source code from different files.
>>
>> At the design level, I don't think that's a good idea. It's essential that this
>> code is easy to read and understand. It must be 100% correct and is
>> extremely performance sensitive to minor changes (eg. instruction order
>> or alignment can have major impact). Inline macros and lots of complex
>> #ifdefs in the code mean getting this right becomes almost impossible.
>> I really don't want to create large blobs of unmaintainable assembly code
>> like several other targets.
>
> My idea is to prevent have different code paths in different files that
> are 'injected' by macros or ifdefs. Getting all of them in one place is
> better imho than spread over multiple files. Now, what you are advocating
> is a different topic: whether the modifications of the generic
> implementation are valuable.
>
>>
>> Note we also should benchmark this on various other cores to see what
>> the impact is. I wrote this memset code for specific reasons - changing that
>> could have a big impact on some cores, so we need to show this doesn't
>> cause regressions.
>
> Ideally it would prefer to have a more concise selection as:
>
> 1. A memset that reads dczid_el0 using mrs (as the default one).
> This will be selected as default and thus should not incur in any
> regression.
>
> 2. A memset that reads the zva size from global variable and is selected
> solely by falkor (and ideally it would handle 64 and 128 cacheline
> sizes).
>
> 3. no zva variant in the case of zva being 0.
>
>>
>> Also we need to decide on how to read the ZVA size. I don't think there is
>> any point in supporting that many options (reading it, using a global, using
>> an ifunc and not using ZVA at all). The very first memset had a few of these
>> options, and I removed them precisely because it created way too many
>> permutations to test and benchmark. So if we add ifuncs, then we shouldn't
>> need to use a global too. For the dynamic linker we could just use a basic
>> memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
>> and no ZVA).
>
> Siddhesh can help me out, but the idea is *precisely* that Falkor is faster
> using a global variable than accessing the dczid_el0 and its shown by its
> benchmarks. Now, the discussion is indeed if this microarchitecture difference
> is worth adding ifuncs variants.
For at least the micro-archs I work with, reading dczid_el0 can and
will most likely be faster than reading from global memory.
Especially if the global memory is not in the L1 cache. This is one
case where a micro-benchmark can fall down. If the global memory is
in L1 cache the read is 3 cycles while reading from dczid_el0 is 4
cycles, but once it is out of L1 cache, reading becomes 10x worse plus
it pollutes the L1 cache.
Thanks,
Andrew
>
> For loader it is already not using ifunc (and afaik there is not support
> to do so).
>
>>
>> Finally we'll need to look into more detail at the new memcpy benchmarks -
>> while looping through memory seems like a good idea, it appears like it
>> only increments by 1. So at first sight it's still testing the exact same thing as
>> the existing benchmarks - all data is always cached in L1. For memset I guess
>> we're still missing a randomized benchmark based on real trace data.
>
> I think we can move this discussion to the benchmark thread itself.