This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Removing longjmp error handling from the dynamic loader

From: Florian Weimer <fweimer at redhat dot com>
To: libc-alpha at sourceware dot org
Date: Tue, 05 Mar 2019 17:37:09 +0100
Subject: Removing longjmp error handling from the dynamic loader

Currently, dynamic loader operations triggered by explicit function
calls (dlopen, dlsym, dlcose) wrapped in exception handlers, installed
using _dl_catch_error.  This function calls setjmp, and when an error is
raised the dynamic linker calls longjmp, resetting the call stack.

This leads to strange bugs, such as undefined symbols in init/fini calls
being treated as non-fatal soft errors reported through dlerror
(bug 24304).  It also leads to a design where most dlopen failures are
not handled locally.  Instead they bubble up to the top-level dlopen
error handler installed using _dl_catch_error, which then attempts to
undo the changes of the  partially completed state to the global dynamic
linker state (see bug 20839 for a couple of problems with that).

In the current scheme, more localized error handling is problematic
because it has high syntactic overhead: You need to define a struct for
argument data and a separate function that receives the data, and pass
both to _dl_catch_error.  There also could be a performance overhead if
individual malloc calls were protected in this way because each call to
_dl_catch_error incurs a call to setjmp.

Does anyone think we should retain the current error handling scheme?
I personally do not have a problem with exceptions and stack unwinding,
but if this is what we want, we should use a DWARF-based unwinder and
GCC's exception handling features (the limited support in the C front
end is probably sufficient).

The alternative to unwinding is an explicit struct dl_exception *
argument for functions which can fail, and use the return value to
indicate whether there was a fatal error.  This sometimes causes issues
where the return value is already used to signal both success and
non-fatal error (e.g., -1 for failure from open_verify, or NULL for
RTLD_NOLOAD failure from _dl_map_object_from_fd).

There is some impact on <dl-machine.h> because the relocation processing
needs to change.  We can convert the relocation processing first to the
new scheme and continue to signal any errors using longjmp in the
generic code.  But supporting twice the number of relocation APIs for
incremental conversion of targets will still be difficult.  I think we
are still looking at one fairly large patch, given the number of
architectures we support, although the changes should just be a few
dozen lines per architecture.

We cannot convert the generic code first because that would mean calling
setjmp each time before calling into architecture-specific code, which I
expect will be too problematic from a performance point of view.

A third option is not use an explicit struct dl_exception * argument,
but a per-thread variable.  This will require changes to support TLS
(presumably the initial-exec variant) in the dynamic linker itself,
which is currently missing.  Since the exception pointer is only needed
in case of an error, using a TLS variable for it will avoid the overhead
of maintaining the explicit exception pointer argument and passing it
around.  Adding TLS support to the dynamic linker could be implemented
incrementally across architectures, but the conversion itself faces the
same flag day challenge as the explicit argument solution.  The explict
argument also ensures that places stick out where encoding fatal errors
in the return argument is difficult.  (A fourth option would compile the
dynamic linker twice and use the TLS-less version for the initial
loading of the executables and its dependencies.)

>From a source code and binary size point of view, there is not much
difference between using longjmp and the explicit function argument
approach on x86-64.

Does anyone want to keep the longjmp approach?  Should I polish my patch
and extend it to cover all architectures?

Thanks,
Florian

Follow-Ups:
- Re: Removing longjmp error handling from the dynamic loader
  - From: Rich Felker
- Re: Removing longjmp error handling from the dynamic loader
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]