Re: [PATCH 0/2] Multiarch hooks for memcpy variants

On Mon, Aug 14, 2017 at 6:36 AM, Wilco Dijkstra <> wrote:
> Zack Weinberg wrote:
>> On Fri, Aug 11, 2017 at 2:53 PM, Siddhesh Poyarekar <> wrote:
>> > On Friday 11 August 2017 11:36 PM, Zack Weinberg wrote:
>>>>> may be the generic __memcpy_chk should call the ifunced
>>>>> memcpy so it goes through an extra plt indirection, but
>>>>> at least less target specific code is needed.
>>>> I was thinking of making this suggestion myself.  I think that would
>>>> be a better maintainability/efficiency tradeoff.  (Of course, I also
>>>> think we shouldn't bypass ifuncs for intra-libc calls.)
>>> That was my initial approach, but I was under the impression that PLTs
>>> in internal calls were frowned upon, hence the ifuncs similar to what's
>>> done in x86.  If this is acceptable, I could do more tests to check
>>> gains within the library if we were to call memcpy via ifunc.
>> There's been a bunch of inconclusive arguments about this in the past.
>> If you have the time and the resources to do some thorough testing and
>> properly resolve the question, that would be really great.
> I don't believe you can resolve this generally, it's highly dependent on the details.
> If the generic implementation is very efficient, the possible gain of specialized
> ifuncs may be so low that it can never offset the overhead of an ifunc. Also note
> that you're always slowing down the generic case, so if that version is used in
> many cases, an ifunc wouldn't make sense.
> I haven't looked in detail at memcpy use in GLIBC, however if the statistics are
> similar to typical use I measured then it makes no sense to use ifuncs. Large
> copies can benefit from special tweaks, and in that case the overhead of an ifunc
> would be much smaller (both relatively and absolutely due to lower frequency), so
> that's where an ifunc might be useful.

Last time we had this argument, someone (Ondrej?) claimed that the
overhead of going through an ifunc for intra-libc calls (specifically
to memcpy, IIRC) was dwarfed by the I-cache costs of having both the
generic and the targeted version of the function get used. I would
really like to see measurements addressing that specific point.


