This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Remove unnecessary IFUNC dispatch for __memset_chk.


On Tue, Aug 11, 2015 at 03:06:53AM -0400, Mike Frysinger wrote:
> On 11 Aug 2015 07:55, OndÅej BÃlka wrote:
> > On Mon, Aug 10, 2015 at 10:48:03PM -0400, Mike Frysinger wrote:
> > > On 10 Aug 2015 23:12, OndÅej BÃlka wrote:
> > > > On Sun, Aug 09, 2015 at 11:09:20PM -0400, Mike Frysinger wrote:
> > > > > On 09 Aug 2015 14:24, Zack Weinberg wrote:
> > > > > > Is an IFUNC's variant-selecting function called only once per process,
> > > > > > or every time?
> > > > > 
> > > > > it's once-per-process.  if it were every time, it'd defeat the point of the
> > > > > optimization.
> > > >
> > > > No, its once per each shared library.
> > > 
> > > it depends on how you're counting.  i'm talking about each ifunc resolver -- it 
> > > only executes once per process.  yes, the overall lookup of ifunc relocs happens
> > > on a per-object basis, but it doesn't mean each resolver runs more than once.
> >
> > No, you are wrong.
> 
> i have no idea what your problem is, but you need to take it down a notch
>
Well problem is that your claim that ifunc resolver is called only once
per process is completely and utterly wrong.
 
> Â You should verify, not make guesses. Its clearly
> > called twice in following program.
> 
> yes, once per PLT slot.  once the slot is resolved, it isn't executed again, on 
> a per-process basis.

While that is correct it doesn't change that ifunc resolver is called
once per shared library that calls function as you have separate PLT
slot per each library.

Example that I send on last mail clearly show that resolver was called
twice and if you add ten other libraries that call foo it would print
ifunc eleven times so its clear that resolver was called more than once.

> 
> > > > > > If we sent libc.so-internal calls to 'memset' through the PLT (as is
> > > > > > currently done for 'malloc') would that mean they were subject to IFUNC
> > > > > > dispatch?
> > > > > 
> > > > > it's a double edge sword.  we specifically want to avoid the PLT for two 
> > > > > reasons:
> > > > > (1) speed (PLT is slow)
> > > > > (2) interposition (we don't want someone exporting a memset symbol and then 
> > > > > internal glibc code calling that instead of our own version)
> > > >
> > > > No, as I wrote in 
> > > > 
> > > > [PATCH] x86-64: Remove plt bypassing of ifuncs. 
> > > > 
> > > > thats completely flawed analysis. In best case you could save few
> > > > cycles. As I looked on functions you would for most functions lose
> > > > at leat twenty cycles as differences between implementations are that
> > > > big.
> > > 
> > > it isn't a flawed analysis as i covered this explicitly in the part of my
> > > e-mail that you snipped
> > 
> > No, as you snipped essential part its flawed.
> 
> i have no idea what you thought you read, but cleary it isn't what i wrote.
> i'm not going to bother responding here anymore since it's completely pointless 

Could you at least clarify for others where you did address my
criticism?

You started by making following claim:

> > > > > it's a double edge sword.  we specifically want to avoid the PLT for two 
> > > > > reasons:
> > > > > (1) speed (PLT is slow)

If that is false then rest of your argument falls apart. And I am
claiming its false most time so rest of your discussion about (1) isn't
relevant and we need just handle 2 by adding new symbol.

Clearly if you use plt bypassing on rarely used ifunc like strpbrk that
I fixed then you would save few cycles on that but may lose more than
hundred as function was not in cache but one you get by plt was. So I
just asked where is your evidence that this doesn't happen.

I didn't respond to rest of analysis as it doesn't take into account
benchmark results. Its actually very easy to see impact of plt bypassing
there. For memset you have problem that calloc(1024) could be
considerably faster, you need just read benchtests below. As
builtin_memset got compliled into jmp memset@plt it shows that overhead
isn't noticable. Same with memcpy which gets called in realloc with big
arguments. I could dig more cases.

 builtin_memset  simple_memset __memset_sse2 __memset_avx2

Length    1, alignment  0, c -65: 15.4062 7.17188 10.9375 10.2031
Length    2, alignment  0, c -65: 13.5  8.89062 11.3125 9.64062
Length    4, alignment  0, c -65: 12.9844 12.0938 11.0312 8.84375
Length    8, alignment  0, c -65: 11.7344 15.5469 10.6094 7.64062
Length   16, alignment  0, c -65: 15.0781 23.9688 10.3281 10.4219
Length   32, alignment  0, c -65: 14.7031 37.4219 9.57812 10.7031
Length   64, alignment  0, c -65: 14  80.8438 9.6875  9.21875
Length  128, alignment  0, c -65: 18.1094 137.812 15.3125 19.0781
Length  256, alignment  0, c -65: 15.2656 272.141 21.2656 11.7812
Length  512, alignment  0, c -65: 19.2656 502.469 34.3594 18.6562
Length 1024, alignment  0, c -65: 32.7188 940.766 63.6719 31.2812
Length 2048, alignment  0, c -65: 61.7188 1880.83 121 60.7812
Length 4096, alignment  0, c -65: 118.172 3718.7  239.469 118.641
Length 8192, alignment  0, c -65: 255.141 7373.38 469.422 252.125
Length 16384, alignment  0, c -65:  484.359 15742.4 1478.39 481.812
Length 32768, alignment  0, c -65:  990.562 29551.1 1978.39 966.047
Length 65536, alignment  0, c -65:  6163.86 64354.7 5663.06 5779.97
Length 131072, alignment  0, c -65: 12244.4 129994  11414.3 11640.7



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]