This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Fwd: Re: [PATCH] Remove unnecessary IFUNC dispatch for __memset_chk.


On Tue, Aug 11, 2015 at 03:12:50PM -0400, Zack Weinberg wrote:
> [I *was* going to attach a nice graph of this data to this message, but
> apparently the mailing list won't let me do that.]
> 
> On 08/11/2015 05:32 AM, OndÅej BÃlka wrote:
> > Its actually very easy to see impact of plt bypassing
> > there. For memset you have problem that calloc(1024) could be
> > considerably faster, you need just read benchtests below. As
> > builtin_memset got compliled into jmp memset@plt it shows that overhead
> > isn't noticable. Same with memcpy which gets called in realloc with big
> > arguments. I could dig more cases.
> > 
> >  builtin_memset  simple_memset __memset_sse2 __memset_avx2
> > 
> > Length    1, alignment  0, c -65: 15.4062 7.17188 10.9375 10.2031
> > Length    2, alignment  0, c -65: 13.5  8.89062 11.3125 9.64062
> > Length    4, alignment  0, c -65: 12.9844 12.0938 11.0312 8.84375
> > Length    8, alignment  0, c -65: 11.7344 15.5469 10.6094 7.64062
> > Length   16, alignment  0, c -65: 15.0781 23.9688 10.3281 10.4219
> > Length   32, alignment  0, c -65: 14.7031 37.4219 9.57812 10.7031
> > Length   64, alignment  0, c -65: 14  80.8438 9.6875  9.21875
> > Length  128, alignment  0, c -65: 18.1094 137.812 15.3125 19.0781
> > Length  256, alignment  0, c -65: 15.2656 272.141 21.2656 11.7812
> > Length  512, alignment  0, c -65: 19.2656 502.469 34.3594 18.6562
> > Length 1024, alignment  0, c -65: 32.7188 940.766 63.6719 31.2812
> > Length 2048, alignment  0, c -65: 61.7188 1880.83 121 60.7812
> > Length 4096, alignment  0, c -65: 118.172 3718.7  239.469 118.641
> > Length 8192, alignment  0, c -65: 255.141 7373.38 469.422 252.125
> > Length 16384, alignment  0, c -65:  484.359 15742.4 1478.39 481.812
> > Length 32768, alignment  0, c -65:  990.562 29551.1 1978.39 966.047
> > Length 65536, alignment  0, c -65:  6163.86 64354.7 5663.06 5779.97
> > Length 131072, alignment  0, c -65: 12244.4 129994  11414.3 11640.7
> 
> If I understood you correctly, the difference between builtin_memset and
> __memset_avx2 should be exactly the PLT overhead, and the other two are
> just in there for comparison. 

I wrote there that its worse than plt overhead. You have extora overhead of
one jump, assembly is 

0000000000401960 <builtin_memset>:
  401960:       e9 3b f8 ff ff          jmpq   4011a0 <memset@plt>
  401965:       90                      nop
  401966:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40196d:       00 00 00 



> I would not call a difference of
> approximately four microseconds on short calls to memset
> "isn't noticeable".  (Are these numbers microseconds?)
> 
Yes these are microseconds but I was talking about memset in calloc
(1024) and more where benchtests start giving semiaccurate data. Mainly
as these tests were present for long time we couldn't make excuse that
we didn't know. 

What benchtest produces as for sizes upto 256 bytes is complete garbage.
No application will have such profile as it didn't measure branch
misprediction and other factors. If you do more accurate measurements
then you will find that setting one byte is very expensive as its slower
than setting 128 bytes. 

http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memset_profile_avx/results_gcc/result.html

> A microbenchmark cannot address the question of whether having both the
> SSE2 and AVX2 implementations of memset in the cache measurably harms
> *overall* performance.
> 
Correct thats why we need whole-system profiling. And just measuring
memset running time like my profiler does isn't enough as both could
always occupy cache but additional cache misses would be in rest of
program as if both would occupy kilobyte there is only 30kb of instruction cache instead 31kb


> I observe that AVX only starts to be a consistent win vs SSE2 at about
> 256 bytes.  Very small memsets should of course be being inlined, but I
> wonder if a unified implementation that doesn't bother with AVX2 for
> fewer than 256 bytes, and internally tests the CPU features for larger
> blocks, would wind up being better overall.  (HJ just posted patches
> that would make testing the CPU features every single time quite cheap.)
> If that turns out to be the case for memcpy and memmove as well, maybe
> this entire IFUNC mess could just be junked.
>
No, that would again defeat point of ifunc where you don't have spend
any cycle on checks as you do plt indirection anyway. 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]