This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Implement strcpy family IFUNC selectors in C


On Sun, Jun 11, 2017 at 7:31 AM, Zack Weinberg <zackw@panix.com> wrote:
> On 06/11/2017 10:21 AM, H.J. Lu wrote:
>> On Sun, Jun 11, 2017 at 7:01 AM, Zack Weinberg <zackw@panix.com> wrote:
>>> On 06/11/2017 09:50 AM, H.J. Lu wrote:
>>>> Implement strcpy family IFUNC selectors in C.
>>>>
>>>> All internal calls within libc.so can use IFUNC on x86-64 since unlike
>>>> x86, x86-64 doesn't need to reserve a register to make a PLT call.  For
>>>> libc,a, we can't use IFUNC for functions which are called before IFUNC
>>>> has been initialized.  Use IFUNC internally reduces the icache footprint
>>>> since libc.so and other codes in the process use the same implementations.
>>>> The patch uses IFUNC for strcpy family functions within libc.
>>>>
>>>> Any comments?
>>>
>>> I like the idea, but I don't understand ifuncs nearly well enough to
>>> comment on your code.  I recall _strenuous_ objections to this concept
>>> in the past; please post some performance numbers or something.
>>
>> I don't believe there is a benchmark where hot spots are within libc.so.
>> BTW, if there are such benchmarks, I'd love to know.
>>
>> The main benefit is to reduce icache footprint at a price of an indirect
>> branch via PLT.
>
> Could you maybe produce numbers on the reduced icache footprint?  perf
> should be able to capture that...
>

Use IFUNC for memcpy, memset, memcmp and strlen inside libc.so:

[hjl@gnu-tools-1 build-x86_64-linux]$ perf  stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.

 Performance counter stats for './string/inl-tester -- --direct':

             6,003      L1-icache-load-misses:u

       0.075482253 seconds time elapsed

[hjl@gnu-tools-1 build-x86_64-linux]$ perf  stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.

 Performance counter stats for './string/inl-tester -- --direct':

             5,608      L1-icache-load-misses:u

       0.072053877 seconds time elapsed

[hjl@gnu-tools-1 build-x86_64-linux]$ perf  stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.

 Performance counter stats for './string/inl-tester -- --direct':

             6,489      L1-icache-load-misses:u

       0.076816497 seconds time elapsed

[hjl@gnu-tools-1 build-x86_64-linux]$

Use SSE2 for memcpy, memset, memcmp and strlen inside libc.so:

[hjl@gnu-tools-1 build-x86_64-linux]$ perf  stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.

 Performance counter stats for './string/inl-tester -- --direct':

             6,763      L1-icache-load-misses:u

       0.077086408 seconds time elapsed

[hjl@gnu-tools-1 build-x86_64-linux]$ perf  stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.

 Performance counter stats for './string/inl-tester -- --direct':

             5,920      L1-icache-load-misses:u

       0.078835639 seconds time elapsed

[hjl@gnu-tools-1 build-x86_64-linux]$ perf  stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.

 Performance counter stats for './string/inl-tester -- --direct':

             9,884      L1-icache-load-misses:u

       0.075682297 seconds time elapsed

[hjl@gnu-tools-1 build-x86_64-linux]$

Another issue with not using IFUNC inside of libc.so:

void *
__memccpy (void *dest, const void *src, int c, size_t n)
{
  void *p = memchr (src, c, n);

  if (p != NULL)
    return __mempcpy (dest, src, p - src + 1);

  memcpy (dest, src, n);
  return NULL;
}

SSE2 versions of mempcpy and memcpy are always used even when there
may be faster versions available.

-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]