This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Fwd: Re: [PATCH] Remove unnecessary IFUNC dispatch for __memset_chk.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Zack Weinberg <zackw at panix dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Tue, 11 Aug 2015 22:24:44 +0200
- Subject: Re: Fwd: Re: [PATCH] Remove unnecessary IFUNC dispatch for __memset_chk.
- Authentication-results: sourceware.org; auth=none
- References: <55CA4842 dot 5020402 at panix dot com> <55CA4932 dot 8040309 at panix dot com>
On Tue, Aug 11, 2015 at 03:12:50PM -0400, Zack Weinberg wrote:
> [I *was* going to attach a nice graph of this data to this message, but
> apparently the mailing list won't let me do that.]
>
> On 08/11/2015 05:32 AM, OndÅej BÃlka wrote:
> > Its actually very easy to see impact of plt bypassing
> > there. For memset you have problem that calloc(1024) could be
> > considerably faster, you need just read benchtests below. As
> > builtin_memset got compliled into jmp memset@plt it shows that overhead
> > isn't noticable. Same with memcpy which gets called in realloc with big
> > arguments. I could dig more cases.
> >
> > builtin_memset simple_memset __memset_sse2 __memset_avx2
> >
> > Length 1, alignment 0, c -65: 15.4062 7.17188 10.9375 10.2031
> > Length 2, alignment 0, c -65: 13.5 8.89062 11.3125 9.64062
> > Length 4, alignment 0, c -65: 12.9844 12.0938 11.0312 8.84375
> > Length 8, alignment 0, c -65: 11.7344 15.5469 10.6094 7.64062
> > Length 16, alignment 0, c -65: 15.0781 23.9688 10.3281 10.4219
> > Length 32, alignment 0, c -65: 14.7031 37.4219 9.57812 10.7031
> > Length 64, alignment 0, c -65: 14 80.8438 9.6875 9.21875
> > Length 128, alignment 0, c -65: 18.1094 137.812 15.3125 19.0781
> > Length 256, alignment 0, c -65: 15.2656 272.141 21.2656 11.7812
> > Length 512, alignment 0, c -65: 19.2656 502.469 34.3594 18.6562
> > Length 1024, alignment 0, c -65: 32.7188 940.766 63.6719 31.2812
> > Length 2048, alignment 0, c -65: 61.7188 1880.83 121 60.7812
> > Length 4096, alignment 0, c -65: 118.172 3718.7 239.469 118.641
> > Length 8192, alignment 0, c -65: 255.141 7373.38 469.422 252.125
> > Length 16384, alignment 0, c -65: 484.359 15742.4 1478.39 481.812
> > Length 32768, alignment 0, c -65: 990.562 29551.1 1978.39 966.047
> > Length 65536, alignment 0, c -65: 6163.86 64354.7 5663.06 5779.97
> > Length 131072, alignment 0, c -65: 12244.4 129994 11414.3 11640.7
>
> If I understood you correctly, the difference between builtin_memset and
> __memset_avx2 should be exactly the PLT overhead, and the other two are
> just in there for comparison.
I wrote there that its worse than plt overhead. You have extora overhead of
one jump, assembly is
0000000000401960 <builtin_memset>:
401960: e9 3b f8 ff ff jmpq 4011a0 <memset@plt>
401965: 90 nop
401966: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40196d: 00 00 00
> I would not call a difference of
> approximately four microseconds on short calls to memset
> "isn't noticeable". (Are these numbers microseconds?)
>
Yes these are microseconds but I was talking about memset in calloc
(1024) and more where benchtests start giving semiaccurate data. Mainly
as these tests were present for long time we couldn't make excuse that
we didn't know.
What benchtest produces as for sizes upto 256 bytes is complete garbage.
No application will have such profile as it didn't measure branch
misprediction and other factors. If you do more accurate measurements
then you will find that setting one byte is very expensive as its slower
than setting 128 bytes.
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memset_profile_avx/results_gcc/result.html
> A microbenchmark cannot address the question of whether having both the
> SSE2 and AVX2 implementations of memset in the cache measurably harms
> *overall* performance.
>
Correct thats why we need whole-system profiling. And just measuring
memset running time like my profiler does isn't enough as both could
always occupy cache but additional cache misses would be in rest of
program as if both would occupy kilobyte there is only 30kb of instruction cache instead 31kb
> I observe that AVX only starts to be a consistent win vs SSE2 at about
> 256 bytes. Very small memsets should of course be being inlined, but I
> wonder if a unified implementation that doesn't bother with AVX2 for
> fewer than 256 bytes, and internally tests the CPU features for larger
> blocks, would wind up being better overall. (HJ just posted patches
> that would make testing the CPU features every single time quite cheap.)
> If that turns out to be the case for memcpy and memmove as well, maybe
> this entire IFUNC mess could just be junked.
>
No, that would again defeat point of ifunc where you don't have spend
any cycle on checks as you do plt indirection anyway.