This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Document use of IFUNC support outside of libc.


On 04/03/16 21:48, Carlos O'Donell wrote:
> On 03/04/2016 12:54 PM, Szabolcs Nagy wrote:
>> On 03/03/16 21:10, Carlos O'Donell wrote:
>>> I attempted to distill some of your notes here:
>>> https://sourceware.org/glibc/wiki/GNU_IFUNC
>>>
>>
>> thanks, i was meaning to write something about it on the wiki,
>> but it is a bit hard to separate the bugs from the features.
> 
> I think we should make this work sensibly for a sensible set
> of use cases. In particular we are probably going to have to
> explicitly what is and is not supported, and what functions
> you can and can't call. I'm happy for IFUNC to exist for user
> code if we impose limits like: only access local variables,
> only call local functions, only use POD data types, only call
> the following glibc functions, etc. etc.
> 
>> i identified some issues:
>>
>> * the first point about bind now is not entirely correct,
>> lazy binding does not change that much.
> 
> Clarified. I agree the ordering doesn't change, all I wanted to
> do was provide some background about *why* on certain machines
> this fails.
> 
>> the reloc processing order at load time is:
>>
>> 1) DT_REL(A) relocs
>> 2) DT_REL(A) relocs that call ifunc resolvers
>> 3) DT_JMPREL relocs (may call ifunc resolvers or delay them)
>> 4) DT_JMPREL relocs that call ifunc resolvers
> 
> This is the ordering per elf_dynamic_do_Rel right? Where
> we force IRELATIVE to be resolved after in every given
> group (but not across the groups e.g. 1) 3) 2) 4)).
> 

_ELF_DYNAMIC_DO_RELOC in elf/dynamic-link.h orders 1,2
before 3,4 and elf_dynamic_do_Rel in elf/do-rel.h orders
3 before 4.

3 before 4 is also guaranteed by binutils ld since
https://sourceware.org/bugzilla/show_bug.cgi?id=13302

i think 1 is ordered before 2 only in recent binutils ld
https://sourceware.org/bugzilla/show_bug.cgi?id=18841
(and it seems it was only fixed for x86, ppc and s390)

i think JUMP_SLOT relocs within 3 are also sorted by ld
such that STT_GNU_IFUNC symbols come last.

>> (for example 1) can be data access through GOT, 2) is ifunc
>> resolved function address access through GOT, 3) is extern
>> function call, 4) is ifunc resolved function call that binds
>> locally e.g. static function with _IRELATIVE reloc.)
>>
>> the only difference between lazy binding and bind now is at
>> step 3): run time vs load time ifunc resolution.
> 
> Agreed.
> 
>> of course the ordering in 3) can break resolvers with bind
>> now that work with lazy binding, but the real problem is 2):
>> a resolver called there must only depend on relocs in 1).
> 
> I was thinking about this.
> 
> Would it be possible on ARM and PPC64 whose R_*_IRELATIVE
> relocs are in DT_REL* to reorder the processing in the dynamic
> loader? Resolve DT_JMPREL first then DT_REL*
> 
> That would give those machines feature parity with x86_64
> without needing to rewrite the relocations in binutils to
> handler this case?
> 

i haven't looked at non-x86 targets yet.

i think glibc dynlinker can do the relocs in arbitrary order
(the order is only observable through ifunc resolvers), but
the code might become ugly if there is arch dependent ordering.

>> it is still possible to call extern functions from an ifunc
>> resolver, but only if it is forced to use relocs in 1) (e.g.
>> call through a volatile funcptr or -fno-plt).  i'm not sure
>> if glibc wants to document this to work, because the user
>> needs to know about relocations (which is compiler/linker
>> internals).  the nasty part is that the compiler is free to
>> add extern calls (into libc or compiler runtime) which can
>> break the resolver so it cannot be written in c or c++ in
>> principle :(
> 
> Correct.
> 
> On x86 with multiversioning the compiler emits multiple clones
> of a function with different optimizations and selects based
> on cpuid results. To get the cpuid results the ifunc resolver
> emitted by the compiler calls into libgcc. As it is 
> implemented this multiversioning only works on x86 because of
> the relocation ordering.
> 
>> the dynamic linker could do the reloc ordering a bit better
>> (so e.g. 2) happens after 3) in case of lazy binding), but
>> i'm not sure how much that would help if potentially all
>> functions may be ifunc resolved in a module.
> 
> Could you expand on this a bit more? What would be the problem
> in having the dynamic loader do relocation processing in this
> order: 1) 3) 2) 4).
>  

the ordering does not fix the case when ifunc resolvers
reference ifunc resolved functions in the same module.
(because the relocs are not ordered according to ifunc
dependency)

otherwise i think it would make the most common cases work.
(both lazy and non-lazy binding, although lazy binding would
work in more cases)

>> * an omission from that wiki page is static linking:
>> ifunc resolvers run very early then (so memcpy etc work
>> during libc initialization), and that breaks stack-protection
>> etc instrumentation: the thread pointer is not yet set up.
> 
> I mentioned that?
> 
> "The resolver must not be compiled with -fstack-protector-all
> or any similar protections e.g. asan, since they may require
> early setup which has not yet completed."
> 
> I just didn't talk about static vs. dynamic, I just forbid it
> in general.
> 

sorry, indeed it is documented, but i wanted to note that it only
fails with static linking because i think this is undesirable.
(that code is running without thread pointer set up so accessing
errno or other tls would crash).

>> the vdso is not yet set up either and the vsyscall mechanism
>> uses ifunc now, so vdso does not work with static linking at
>> all (!) clock_gettime goes through a syscall (i think this is
>> a bug that can result in surprising perf regression for users
>> who expect speedup from static linking so i opened BZ 19767 ).
> 
> Agreed.
> 
>> i suspect there might be other limitations on resolvers
>> because ptr mangling is not set up either..
> 
> Maybe.
> 
>> probably static linking can be fixed by having two sets of
>> ifunc resolvers: one that only the libc uses and runs early
>> and another set that runs after some c runtime init is done
>> similar to the dynamic linked case.
> 
> Right.
> 
>> i actually would like to use vdso from ifunc resolvers
>> to do the ifunc dispatch based on information that is only
>> available in the kernel and cannot be easily communicated
>> through other means (e.g. sysfs stuff).
> 
> Sure. Examples needed.
>  

there seems to be interest in optimizations/dispatch based
on the micro architecture which is not easily available in
userspace currently (on aarch64).

linux exports various cpu info in /sys but that is not
stable abi and users probably don't want large number of
syscalls traversing the /sys tree at process startup just
to get slightly better tuned memcpy or similar.

one idea by Adhemerval Zanella was to use vdso for this.
(the kernel can provide a versioned function symbol there
to return a pointer to some cpu info struct, which can be
read only thus shared across processes).
there is no proposed design for this yet either on kernel
or libc side, but it would make sense if ifunc could use it.

currently the only reliable mechanisms for ifunc dispatch
are hwcap feature bits (if passed as argument) or cpuid
like instruction (e.g. on aarch64 cpuid like instructions
are not available to userspace, but can be emulated by the
kernel or provided as syscall, in either case it would be
context switch into the kernel, which can be bad if large
number of ifunc resolvers do it e.g. because function multi-
versioning is implemented that way, unless there is some
caching mechanism which is also not easy to do in ifunc...)

>> * yet another issue is that the ifunc resolver type
>> signature is different on different targets.
> 
> This is really lame.
> 
>> (and if the user defined resolver takes no argument, but the
>> dynamic linker calls it with arguments that is not strictly
>> correct in c even if it happens to work for most call abis:
>> there were hardening proposals based on type signature checks
>> for indirect calls which the dynamic linker would violate).
> 
> Agreed, we need to fix this.
> 

i think it's not easy to fix: binutils and gcc already
have ifunc test cases (where resolvers take no argument)

most non-x86 archs take a hwcap argument, but in the
mips ifunc patch the resolver has 3 arguments.

>>> That way I can point users at this.
>>>
>>> In gperftools tcmalloc added an IFUNC use [1] which
>>> violates some of the requirements under -Wl,z,now,
>>> so I have a need to document this support and discuss
>>> with tcmalloc developers what we might do. Right now
>>> they call way too much code for this to work.
>>>
>>> Cheers,
>>> Carlos.
>>>
>>> [1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7
>>>
>> +static bool sized_delete_enabled(void) {
>> +  if (tcmalloc_sized_delete_enabled != 0) {
>> +    return !!tcmalloc_sized_delete_enabled();
>> +  }
>>
>> i think this call happens to work because the func address
>> check for the weak ref forces the reloc to happen at step 1).
> 
> OK.
> 
>> +  const char *flag = TCMallocGetenvSafe("TCMALLOC_ENABLE_SIZED_DELETE");
>> +  return tcmalloc::commandlineflags::StringToBool(flag, false);
>>
>> i think this will crash if the address of delete is used
>> (so ifunc resolver runs at step 2 while PLTGOT entries are
>> uninitialized) independently of binding lazy vs now.
>> with binding now it may crash without taking the address
>> of delete.
> 
> Right.
>  
>> i'll try to update the wiki, but will wait for some
>> feedbacks here for a while.
> 
> Thanks! Feel free to update the page!
> 
> Cheers,
> Carlos.
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]