This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Removing longjmp error handling from the dynamic loader



On 06/03/2019 12:40, Rich Felker wrote:
> On Tue, Mar 05, 2019 at 05:37:09PM +0100, Florian Weimer wrote:
>> Currently, dynamic loader operations triggered by explicit function
>> calls (dlopen, dlsym, dlcose) wrapped in exception handlers, installed
>> using _dl_catch_error.  This function calls setjmp, and when an error is
>> raised the dynamic linker calls longjmp, resetting the call stack.
>>
>> This leads to strange bugs, such as undefined symbols in init/fini calls
>> being treated as non-fatal soft errors reported through dlerror
>> (bug 24304).  It also leads to a design where most dlopen failures are
>> not handled locally.  Instead they bubble up to the top-level dlopen
>> error handler installed using _dl_catch_error, which then attempts to
>> undo the changes of the  partially completed state to the global dynamic
>> linker state (see bug 20839 for a couple of problems with that).
>>
>> In the current scheme, more localized error handling is problematic
>> because it has high syntactic overhead: You need to define a struct for
>> argument data and a separate function that receives the data, and pass
>> both to _dl_catch_error.  There also could be a performance overhead if
>> individual malloc calls were protected in this way because each call to
>> _dl_catch_error incurs a call to setjmp.
>>
>> Does anyone think we should retain the current error handling scheme?
>> I personally do not have a problem with exceptions and stack unwinding,
>> but if this is what we want, we should use a DWARF-based unwinder and
>> GCC's exception handling features (the limited support in the C front
>> end is probably sufficient).
>>
>> The alternative to unwinding is an explicit struct dl_exception *
>> argument for functions which can fail, and use the return value to
>> indicate whether there was a fatal error.  This sometimes causes issues
>> where the return value is already used to signal both success and
>> non-fatal error (e.g., -1 for failure from open_verify, or NULL for
>> RTLD_NOLOAD failure from _dl_map_object_from_fd).
>>
>> There is some impact on <dl-machine.h> because the relocation processing
>> needs to change.  We can convert the relocation processing first to the
>> new scheme and continue to signal any errors using longjmp in the
>> generic code.  But supporting twice the number of relocation APIs for
>> incremental conversion of targets will still be difficult.  I think we
>> are still looking at one fairly large patch, given the number of
>> architectures we support, although the changes should just be a few
>> dozen lines per architecture.
>>
>> We cannot convert the generic code first because that would mean calling
>> setjmp each time before calling into architecture-specific code, which I
>> expect will be too problematic from a performance point of view.
>>
>> A third option is not use an explicit struct dl_exception * argument,
>> but a per-thread variable.  This will require changes to support TLS
>> (presumably the initial-exec variant) in the dynamic linker itself,
>> which is currently missing.  Since the exception pointer is only needed
>> in case of an error, using a TLS variable for it will avoid the overhead
>> of maintaining the explicit exception pointer argument and passing it
>> around.  Adding TLS support to the dynamic linker could be implemented
>> incrementally across architectures, but the conversion itself faces the
>> same flag day challenge as the explicit argument solution.  The explict
>> argument also ensures that places stick out where encoding fatal errors
>> in the return argument is difficult.  (A fourth option would compile the
>> dynamic linker twice and use the TLS-less version for the initial
>> loading of the executables and its dependencies.)
>>
>> From a source code and binary size point of view, there is not much
>> difference between using longjmp and the explicit function argument
>> approach on x86-64.
>>
>> Does anyone want to keep the longjmp approach?  Should I polish my patch
>> and extend it to cover all architectures?
> 
> I don't have a strong opinion (and maybe not enough information to
> have any opinion) on whether you keep longjmp or go with something
> else, but I don't think there's any fundamental reason you need to
> change this to fix current bugs. In musl, longjmp is used partly by
> historical accident (I don't recall fully, but I think it was a matter
> of adding dlopen to code that was originally written just for initial
> dynamic linking at program entry), and doesn't have problems like the
> ones you describe.
> 
> For the init/fini "soft errors" problem, it sounds to me like the code
> that runs the ctors should just be outside of the scope of the
> _dl_catch_error. If you've started running ctors, you're past the
> point where the operation can be backed out in any meaningful sense.
> I wonder if it's worse with ifunc, but I think not -- without
> bind_now, the ifunc resolvers don't even need to run before you pass
> the point of no return, and with bind_now, you'd be executing them in
> a context where resolver errors are still "hard" and can/should cause
> dlopen to fail.
> 
> Are there things I'm missing?
> 
> Rich
> 

Florian, it was not clear to me which approach you are actually working
on, explicit struct dl_exception *argument or per-thread variable. In
any case, can't we use Rich suggestion and move the logic starting
at dl_open_worker starting at _dl_init out of the function, and thus
out of _dl_catch_exception?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]