Bug 1319

Summary: dlsym/RTLD_NEXT is broken when more than 1 lib has the symbol.
Product: glibc Reporter: yann langlais <langlais>
Component: dynamic-linkAssignee: Carlos O'Donell <carlos>
Status: REOPENED ---    
Severity: normal CC: carlos, drepper.fsp, fweimer, glibc-bugs, janne.karhunen, langlais
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
See Also: https://sourceware.org/bugzilla/show_bug.cgi?id=19509
Host: Target:
Build: Last reconfirmed:

Description yann langlais 2005-09-09 16:08:31 UTC
The dlsym(RTLD_NEXT, "foo") is broken when more than one library contain the  "foo".

Here are some samples :

Here are 2 little programs and 4 libraries to underline the faulty behaviour:
# the libraries:
for i in 1 2 3 4
do
cat > lib$i.c <<EOF
#include <stdio.h>
#include <dlfcn.h>
void foo() {
     void (*next_foo)(void);
     printf("lib$i.foo()\n");
     if (next_foo = (void (*)(void)) dlsym(RTLD_NEXT, "foo")) next_foo();
}

EOF
gcc -shared -fPIC lib$i.c -o lib$i.so -D_GNU_SOURCE
done
# the test program liking libs at compile time:
cat > chain.c <<EOF
#include <dlfcn.h>
extern void foo();
int main() {
     foo();
     return 0;
}

EOF
gcc chain.c -o chain -L. -l1 -l2 -l3 -l4 -ldl
LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./chain

#And the result :
#
# lib1.foo()
# lib2.foo()
# lib3.foo()
# lib4.foo()
#
# And now a runtime linking version:
#
cat > chain2.c <<EOF
#include <dlfcn.h>
int main() {
         void *l1, *l2, *l3, *l4;
         void (*bar)();
         l1 = dlopen("lib1.so", RTLD_NOW | RTLD_GLOBAL);
         l2 = dlopen("lib2.so", RTLD_NOW | RTLD_GLOBAL);
         l3 = dlopen("lib3.so", RTLD_NOW | RTLD_GLOBAL);
         l4 = dlopen("lib4.so", RTLD_NOW | RTLD_GLOBAL);
         bar = (void (*)()) dlsym(RTLD_DEFAULT, "foo");
         bar();
         dlclose(l4);
         dlclose(l3);
         dlclose(l2);
         dlclose(l1);
         return 0;
}

EOF
gcc chain2.c -o chain2 -ldl -D_GNU_SOURCE
#
LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./chain2
#
# And the result :
#
# lib1.foo()
#

On alternate plateforms (bsd/libtld, Sun/libdl) the result is :

lib1.foo()
lib2.foo()
lib3.foo()
lib4.foo()



Here is a more verbose version of chain2:


cat > chain5.c <<EOF
#include <dlfcn.h>
int main() {
         void *l1, *l2, *l3, *l4;
         void (*bar)();

         l1 = dlopen("lib1.so", RTLD_NOW | RTLD_GLOBAL);
         bar = (void (*)()) dlsym(l1, "foo");
                printf("l1 is %x l1.foo is %x\n", l1, bar);

         l2 = dlopen("lib2.so", RTLD_NOW | RTLD_GLOBAL);
         bar = (void (*)()) dlsym(l2, "foo");
                printf("l2 is %x l2.foo is %x\n", l2, bar);

         l3 = dlopen("lib3.so", RTLD_NOW | RTLD_GLOBAL);
         bar = (void (*)()) dlsym(l3, "foo");
                printf("l3 is %x l3.foo is %x\n", l3, bar);

         l4 = dlopen("lib4.so", RTLD_NOW | RTLD_GLOBAL);
         bar = (void (*)()) dlsym(l4, "foo");
                printf("l4 is %x l4.foo is %x\n", l4, bar);

         bar = (void (*)()) dlsym(l1, "foo");
                printf("l1.foo is %x \n", bar);
         bar = (void (*)()) dlsym(l2, "foo");
                printf("l2.foo is %x \n", bar);
         bar = (void (*)()) dlsym(l3, "foo");
                printf("l3.foo is %x \n", bar);
         bar = (void (*)()) dlsym(l4, "foo");
                printf("l4.foo is %x \n", bar);

         bar = (void (*)()) dlsym(RTLD_DEFAULT, "foo");
                printf("default foo is %x \n", bar);

         bar();
         dlclose(l4);
         dlclose(l3);
         dlclose(l2);
         dlclose(l1);
         return 0;
}

EOF
gcc chain5.c -o chain5 -ldl -D_GNU_SOURCE

LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./chain5

And the result looks like this :

l1 is 804a018 l1.foo is 5556d584
l2 is 804a3b0 l2.foo is 55570584
l3 is 804a710 l3.foo is 55572584
l4 is 804aa70 l4.foo is 55574584
l1.foo is 5556d584
l2.foo is 55570584
l3.foo is 55572584
l4.foo is 55574584
default foo is 5556d584
lib1.foo()

Running the chain2 program with LD_DEBUG="files symbols" gives :

      14129:     symbol=dlsym;  lookup in file=./chain2
      14129:     symbol=dlsym;  lookup in file=/lib/tls/libdl.so.2
      14129:     symbol=_dl_sym;  lookup in file=./chain2
      14129:     symbol=_dl_sym;  lookup in file=/lib/tls/libdl.so.2
      14129:     symbol=_dl_sym;  lookup in file=/lib/tls/libc.so.6
      14129:     symbol=foo;  lookup in file=./chain2
      14129:     symbol=foo;  lookup in file=/lib/tls/libdl.so.2
      14129:     symbol=foo;  lookup in file=/lib/tls/libc.so.6
      14129:     symbol=foo;  lookup in file=/lib/ld-linux.so.2
      14129:     symbol=foo;  lookup in file=lib1.so

--> foo is found in lib1.so

lib1.foo()
      14129:     symbol=foo;  lookup in file=/lib/tls/libc.so.6
      14129:     symbol=foo;  lookup in file=/lib/ld-linux.so.2

--> Then all other lib$i.so are ignored :

      14129:     symbol=dlclose;  lookup in file=./chain2
      14129:     symbol=dlclose;  lookup in file=/lib/tls/libdl.so.2
      14129:     symbol=_dl_close;  lookup in file=./chain2
      14129:     symbol=_dl_close;  lookup in file=/lib/tls/libdl.so.2
      14129:     symbol=_dl_close;  lookup in file=/lib/tls/libc.so.6
      14129:

According to "Open Group Base Specifications Issue 6IEEE Std 1003.1, 2004
Edition" dlopen and dlsym entry, this should be a bug.

Fell free to contact me for any info.

Regards.
Yann LANGAIS.
Comment 1 Jakub Jelinek 2005-09-09 16:58:51 UTC
POSIX just reserves RTLD_NEXT for future use, nothing more.
Comment 2 yann langlais 2005-09-09 17:41:00 UTC
According to http://www.opengroup.org/onlinepubs/009695399/toc.htm it is indeed
reserved for future use : 
>>>>>>>>>>>>>>>>>>
APPLICATION USAGE
        Special purpose values for handle are reserved for future use. These
values and their meanings are:
                
                
                RTLD_DEFAULT
                The symbol lookup happens in the normal global scope; that is, a
search for a symbol using this handle would find the same definition as a direct
use of this symbol in the program code.
                        RTLD_NEXT
                Specifies the next object after this one that defines name. This
one refers to the object containing the invocation of dlsym(). The next object
is the one found upon the application of a load order symbol resolution
algorithm (see dlopen()). The next object is either one of global scope (because
it was introduced as part of the original process image or because it was added
with a dlopen() operation including the RTLD_GLOBAL flag), or is an object that
was included in the same dlopen() operation that loaded this one. 
                        
                        The RTLD_NEXT flag is useful to navigate an
intentionally created hierarchy of multiply-defined symbols created through
interposition. For example, if a program wished to create an implementation of
malloc() that embedded some statistics gathering about memory allocations, such
an implementation could use the real malloc() definition to perform the memory
allocation-and itself only embed the necessary logic to implement the statistics
gathering function.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<                        
        
But as it *IS* documented in the man page as without mention of  "reservation
for future use". 
The bug is then where in the code, whether in the documentation. 

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Extract of "man dlsym" :
   dlsym
       The  function  dlsym()  takes a "handle" of a dynamic library returned by
dlopen and the NUL-terminated symbol
       name, returning the address where that symbol is loaded into memory.  If
the symbol is not found, in the spec-
       ified library or any of the libraries that were automatically loaded by
dlopen() when that library was loaded,
       dlsym() returns NULL.  (The search performed by dlsym() is breadth first
through the dependency tree of  these
       libraries.)  Since the value of the symbol could actually be NULL (so
that a NULL return from dlsym() need not
       indicate an error), the correct way to test for an error is to call
dlerror() to clear any  old  error  condi-
       tions,  then  call  dlsym(), and then call dlerror() again, saving its
return value into a variable, and check
       whether this saved value is not NULL.

       There are two special pseudo-handles, RTLD_DEFAULT and RTLD_NEXT.  The
former will find the  first  occurrence
       of  the  desired symbol using the default library search order.  The
latter will find the next occurrence of a
       function in the search order after the current library.  This allows one
to provide a wrapper around  a  func-
       tion in another shared library.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Comment 3 Jakub Jelinek 2005-09-13 13:51:04 UTC
That's not normative.
Comment 4 yann langlais 2005-09-20 08:43:56 UTC
The New version of the LSB (3.0.0) makes the following statement about dlsym
RTLD_NEXT :
"The value RTLD_NEXT, which is reserved for future use shall be available, with
the behavior as described in ISO POSIX (2003)."

http://refspecs.freestandards.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/baselib-dlsym-1.html

The current behaviour is incompatible with the LSB 3.0.0 statement (according
the informative part of the POSIX description of dslym()).
Comment 5 Ulrich Drepper 2005-09-23 20:29:21 UTC
If the LSB specifies something different, they are wrong.  File a bug with them.
Comment 6 yann langlais 2005-09-23 22:09:32 UTC
LSB specifies the same thing that POSIX says in its descriptive part.
Implementation of libc differs from what LSB and POSIX say.
Then what ? YOU are right supporting something that doesn't behave as specified
??? Did you take 3 minutes to understand what the problem is all about ?

What is the problem with you guys ?

Your political points of view are far away from most of unix programmers Ulrich!

Clearly lsb board is NOT perfect. But are you the perfection ? The point is that
IT has at the very least the good point of existing. And by the way, your own
company has its word to say since it IS part of lsb effort as a Gold Member.

That's kind of what we call "to spit in the soup". 
Comment 7 yann langlais 2005-09-23 22:30:50 UTC
Ok. Since this place is supposed to be a bug filing place (BUG-ZILLA) and not a
dumb troll about politically correct way to call a bee a bee, let me rephrase
the BUG description:

According to :
1/ Posix dlsym RTLD_NEXT descriptive part.
2/ LSB (even prior 2.0) description of dlsym RTLD_NEXT that agrees with 1/
3/ dlsym man page RTLD_NEXT section

The behavior of the following stuff is *INCORRECT* :

for i in 1 2 3 4
do
cat > lib$i.c <<EOF
#include <stdio.h>
#include <dlfcn.h>
void foo() {
     void (*next_foo)(void);
     printf("lib$i.foo()\n");
     if (next_foo = (void (*)(void)) dlsym(RTLD_NEXT, "foo")) next_foo();
}

EOF
gcc -shared -fPIC lib$i.c -o lib$i.so -D_GNU_SOURCE
done
cat > chain2.c <<EOF
#include <dlfcn.h>
int main() {
         void *l1, *l2, *l3, *l4;
         void (*bar)();
         l1 = dlopen("lib1.so", RTLD_NOW | RTLD_GLOBAL);
         l2 = dlopen("lib2.so", RTLD_NOW | RTLD_GLOBAL);
         l3 = dlopen("lib3.so", RTLD_NOW | RTLD_GLOBAL);
         l4 = dlopen("lib4.so", RTLD_NOW | RTLD_GLOBAL);
         bar = (void (*)()) dlsym(RTLD_DEFAULT, "foo");
         bar();
         dlclose(l4);
         dlclose(l3);
         dlclose(l2);
         dlclose(l1);
         return 0;
}

EOF
gcc chain2.c -o chain2 -ldl -D_GNU_SOURCE
#
LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./chain2


May you please :
1/ Correct the bug in dlsym
OR 
2/ Correct :
   - dlsym MANPAGE 
   - LSB from at least 1.2
   - POSIX dlsym RTLD_NEXT descriptive section

Sorry to bother you with this, and many thanks in advance.
Comment 8 Ulrich Drepper 2005-09-24 00:41:10 UTC
There is no bug since there is *nowhere* a description what RTLD_NEXT is
supposed to do.
Comment 9 yann langlais 2005-09-24 07:07:53 UTC
(In reply to comment #8)
> There is no bug since there is *nowhere* a description what RTLD_NEXT is
> supposed to do.

THRE IS A BUG SINCE THE MAN PAGE IS NOT CONFORM TO THE BEHAVIOUR OF LIBDL.SO

PLEASE FIX AT LEAST THE MAN PAGE OF DLSYM REMOVING DESCRIPTION OF RTLD_NEXT
Comment 10 Ulrich Drepper 2005-09-24 14:21:39 UTC
The man pages are not part of glibc.  If you reopen this bug again or file a new
one for the same problem I have no choice but to block your access.
Comment 11 Florian Weimer 2016-01-21 18:00:56 UTC
It's puzzling why this isn't a bug.  I suspect RTLD_NEXT implements dependency order instead of load order.  Even back in 2005, we could no longer change that, and the only way to deal with this is to document the discrepancy with other systems.
Comment 12 Carlos O'Donell 2016-01-24 04:18:44 UTC
(In reply to Florian Weimer from comment #11)
> It's puzzling why this isn't a bug.  I suspect RTLD_NEXT implements
> dependency order instead of load order.  Even back in 2005, we could no
> longer change that, and the only way to deal with this is to document the
> discrepancy with other systems.

POSIX requires dlsym implement dependency ordering, it's written into the standard, see the text in dlopen:
~~~
...
With the exception of the global symbol object obtained via a dlopen() operation on a file of 0, dependency ordering is used by the dlsym() function. Load ordering is used in dlsym() operations upon the global symbol object.
...
~~~

I'm reopening this because regardless of the fact that POSIX only reserves RTLD_NEXT we are basically implementing a Solaris feature and need to reconsider exactly what RTLD_NEXT does when we have more than 1 library with the symbol.

It seems entirely reasonable to be able to walk the entire hierarchy of defined symbols instead of stopping at the first definition.
Comment 13 Florian Weimer 2016-01-25 16:24:43 UTC
I'm still worried that if we change the search results now, we break applications.

This is less of a concern if after the bug fix, we only find *additional* symbols, but if we return *different* symbols, that seems quite risky.
Comment 14 Carlos O'Donell 2016-01-25 16:28:09 UTC
(In reply to Florian Weimer from comment #13)
> I'm still worried that if we change the search results now, we break
> applications.
> 
> This is less of a concern if after the bug fix, we only find *additional*
> symbols, but if we return *different* symbols, that seems quite risky.

Agreed, we'll need regression test cases for every minute change we make here. It's certainly a project that will need lots more testing. I'm considering automation to create complete DAG's for all DSO deps and recording ordering and then using that to drive some comparison while I change the sort algorithms.
Comment 15 Janne Karhunen 2016-02-24 06:57:06 UTC
Regarding ordering, I have a small testcase that shows that the libc symbol found by RTLD_NEXT is not always the default one (pthread_cond_* differs). In other words, interposing anything pthread_cond_* related would always have to dlvsym correct symbols.