This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] x86-64: Implement strcpy family IFUNC selectors in C
On Sun, Jun 11, 2017 at 7:31 AM, Zack Weinberg <zackw@panix.com> wrote:
> On 06/11/2017 10:21 AM, H.J. Lu wrote:
>> On Sun, Jun 11, 2017 at 7:01 AM, Zack Weinberg <zackw@panix.com> wrote:
>>> On 06/11/2017 09:50 AM, H.J. Lu wrote:
>>>> Implement strcpy family IFUNC selectors in C.
>>>>
>>>> All internal calls within libc.so can use IFUNC on x86-64 since unlike
>>>> x86, x86-64 doesn't need to reserve a register to make a PLT call. For
>>>> libc,a, we can't use IFUNC for functions which are called before IFUNC
>>>> has been initialized. Use IFUNC internally reduces the icache footprint
>>>> since libc.so and other codes in the process use the same implementations.
>>>> The patch uses IFUNC for strcpy family functions within libc.
>>>>
>>>> Any comments?
>>>
>>> I like the idea, but I don't understand ifuncs nearly well enough to
>>> comment on your code. I recall _strenuous_ objections to this concept
>>> in the past; please post some performance numbers or something.
>>
>> I don't believe there is a benchmark where hot spots are within libc.so.
>> BTW, if there are such benchmarks, I'd love to know.
>>
>> The main benefit is to reduce icache footprint at a price of an indirect
>> branch via PLT.
>
> Could you maybe produce numbers on the reduced icache footprint? perf
> should be able to capture that...
>
Use IFUNC for memcpy, memset, memcmp and strlen inside libc.so:
[hjl@gnu-tools-1 build-x86_64-linux]$ perf stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.
Performance counter stats for './string/inl-tester -- --direct':
6,003 L1-icache-load-misses:u
0.075482253 seconds time elapsed
[hjl@gnu-tools-1 build-x86_64-linux]$ perf stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.
Performance counter stats for './string/inl-tester -- --direct':
5,608 L1-icache-load-misses:u
0.072053877 seconds time elapsed
[hjl@gnu-tools-1 build-x86_64-linux]$ perf stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.
Performance counter stats for './string/inl-tester -- --direct':
6,489 L1-icache-load-misses:u
0.076816497 seconds time elapsed
[hjl@gnu-tools-1 build-x86_64-linux]$
Use SSE2 for memcpy, memset, memcmp and strlen inside libc.so:
[hjl@gnu-tools-1 build-x86_64-linux]$ perf stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.
Performance counter stats for './string/inl-tester -- --direct':
6,763 L1-icache-load-misses:u
0.077086408 seconds time elapsed
[hjl@gnu-tools-1 build-x86_64-linux]$ perf stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.
Performance counter stats for './string/inl-tester -- --direct':
5,920 L1-icache-load-misses:u
0.078835639 seconds time elapsed
[hjl@gnu-tools-1 build-x86_64-linux]$ perf stat -e
L1-icache-load-misses ./string/inl-tester -- --direct
No errors.
Performance counter stats for './string/inl-tester -- --direct':
9,884 L1-icache-load-misses:u
0.075682297 seconds time elapsed
[hjl@gnu-tools-1 build-x86_64-linux]$
Another issue with not using IFUNC inside of libc.so:
void *
__memccpy (void *dest, const void *src, int c, size_t n)
{
void *p = memchr (src, c, n);
if (p != NULL)
return __mempcpy (dest, src, p - src + 1);
memcpy (dest, src, n);
return NULL;
}
SSE2 versions of mempcpy and memcpy are always used even when there
may be faster versions available.
--
H.J.