Bug 16805 - dynamic library not getting reinitialized on multiple calls to dlopen()
Summary: dynamic library not getting reinitialized on multiple calls to dlopen()
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-03 13:39 UTC by Tim Moloney
Modified: 2017-06-16 15:41 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Example showing failure to initialize a dynamic library after multiple calls to dlopen(). (756 bytes, application/x-compressed-tar)
2014-04-03 13:39 UTC, Tim Moloney
Details
bison grammars which reproduce the bug (1.03 KB, application/x-tar)
2015-02-28 17:30 UTC, Askar Safin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Moloney 2014-04-03 13:39:41 UTC
Created attachment 7525 [details]
Example showing failure to initialize a dynamic library after multiple calls to dlopen().

If a dynamic library is loaded multiple times via dlopen(), subsequent loads do not correctly initialize static variables under the following conditions:
1) A class/struct with a constructor
2) with an inlined function
3) containing a static variable.
Please run the attached example.
  # tar xf gcc_static_issue.tgz
  # cd gcc_static_issue
  # make
  # ./test_static
  Expected behavior (as on RHEL5, g++ 4.1.2):
  Type 'q' to exit or enter to reload/run the DLL
  count:1
  
  count:1
  
  count:1
  
  count:1
  
  count:1
  q
  #
Actual behavior (as on RHEL6 and RHEL7beta, g++ 4.4.7 and 4.8.2, respectively):
  Type 'q' to exit or enter to reload/run the DLL
  count:1
  
  count:2
  
  count:3
  
  count:4
  
  count:5
  q
  #

Please see gcc bug #60731 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60731).
Comment 1 Andreas Schwab 2014-04-07 08:44:12 UTC
dlclose only states an intent, there is no requirement that it removes the object.
Comment 2 Dave Johansen 2014-04-22 15:42:13 UTC
It's unclear what the plan of action is from that last comment. Is this issue going to be fixed?

I realize that dlclose() only states an intent, but the fact remains that glibc/gcc is generating code that is doing the wrong thing. The same code built with clang++, the Intel C++ compiler and older versions of g++ do the right thing and I don't see why more current versions of gcc shouldn't do the right thing as well.

From reading the comments on the gcc bugzilla and revision history of this change, it appears that it was made as an optimization to improve performance. Optimizations are great but they shouldn't be allowed to stick around if they introduce incorrect behavior and in this case it appears that this is a regression and that the optimization needs to be fixed or reverted.
Comment 3 Ondrej Bilka 2014-04-23 08:47:14 UTC
On Tue, Apr 22, 2014 at 03:42:13PM +0000, davejohansen at gmail dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=16805
> 
> --- Comment #2 from Dave Johansen <davejohansen at gmail dot com> ---
> It's unclear what the plan of action is from that last comment. Is this issue
> going to be fixed?
> 
> I realize that dlclose() only states an intent, but the fact remains that
> glibc/gcc is generating code that is doing the wrong thing. The same code built
> with clang++, the Intel C++ compiler and older versions of g++ do the right
> thing and I don't see why more current versions of gcc shouldn't do the right
> thing as well.
>
That is problem at gcc part, why it added unique symbol there.
 
> >From reading the comments on the gcc bugzilla and revision history of this
> change, it appears that it was made as an optimization to improve performance.
> Optimizations are great but they shouldn't be allowed to stick around if they
> introduce incorrect behavior and in this case it appears that this is a
> regression and that the optimization needs to be fixed or reverted.
>
Problem is that this is not optimization but by lack of optimization, so
it could not be easily reverted.

What Jason proposed needs change in linker and migth not work, I will
ask in gcc bug for clarification.
Comment 4 Jason Merrill 2014-06-13 16:38:14 UTC
(In reply to Dave Johansen from comment #2)
> From reading the comments on the gcc bugzilla and revision history of this
> change, it appears that it was made as an optimization to improve
> performance.

It's not an optimization, it's necessary for correctness in the situation where you have

plugin A
  weak foo
plugin B
  weak foo
library C
  weak foo

and plugins A and B depend on library C.  When the program loads A with RTLD_LOCAL, references to foo in A and C bind to the definition in A.  When the program then loads B, references to foo in B bind to the definition in B, but references to foo in C are still bound to the definition in A.  So if foo is a variable, B and C are looking at different copies, leading to chaos.

Marking foo as STB_GNU_UNIQUE fixes this by making references in B resolve to the definition in A.

I had suggested (in https://www.sourceware.org/ml/libc-alpha/2002-05/msg00222.html) that references in A and B should bind to the definition in C instead, but Ulrich rejected that idea.
Comment 5 Ondrej Bilka 2014-06-20 10:38:00 UTC
On Fri, Jun 13, 2014 at 04:38:14PM +0000, jason at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=16805
> 
> Jason Merrill <jason at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |jason at redhat dot com
> 
> --- Comment #4 from Jason Merrill <jason at redhat dot com> ---
> (In reply to Dave Johansen from comment #2)
> > From reading the comments on the gcc bugzilla and revision history of this
> > change, it appears that it was made as an optimization to improve
> > performance.
> 
> It's not an optimization, it's necessary for correctness in the situation where
> you have
> 
> plugin A
>   weak foo
> plugin B
>   weak foo
> library C
>   weak foo
> 
> and plugins A and B depend on library C.  When the program loads A with
> RTLD_LOCAL, references to foo in A and C bind to the definition in A.  When the
> program then loads B, references to foo in B bind to the definition in B, but
> references to foo in C are still bound to the definition in A.  So if foo is a
> variable, B and C are looking at different copies, leading to chaos.
> 
> Marking foo as STB_GNU_UNIQUE fixes this by making references in B resolve to
> the definition in A.
> 
> I had suggested (in
> https://www.sourceware.org/ml/libc-alpha/2002-05/msg00222.html) that references
> in A and B should bind to the definition in C instead, but Ulrich rejected that
> idea.
>
You could try to resend it, it may be accepted now.
Comment 6 Askar Safin 2015-02-28 17:30:10 UTC
Created attachment 8156 [details]
bison grammars which reproduce the bug
Comment 7 Askar Safin 2015-02-28 17:40:35 UTC
This is real bug. Here is my testcase where this bug can cause very very mysterious software failure.

I attached p.tar .

Steps to build and run:
flex --header-file=lex1.yy.hpp -o lex1.yy.cpp scanner1.lpp && flex --header-file=lex2.yy.hpp -o lex2.yy.cpp scanner2.lpp && bison -do parser1.tab.cpp parser1.ypp && bison -do parser2.tab.cpp parser2.ypp && g++ -shared -fPIC -o parser1.so -include lex1.yy.cpp parser1.tab.cpp && g++ -shared -fPIC -o parser2.so lex2.yy.cpp parser2.tab.cpp && g++ main.cpp -ldl && ./a.out

In this example I build bison/flex c++ parser as "parser1.so" and then build some another parser as "parser2.so"

The I load parser1 to main program and then unload it. The I load parser2 to main program and run it. But it reports syntax error (and there really no any syntax error). If you remove lines for loading parser1 you will see that parser2 stops to report syntax error.

So, loading parser1 someway affects parser2.

Moreover, Tim Moloney said that this bug reproduces only if class+inline+static. Okey, let's add some "sed" command to remove static var from some inline function:

flex --header-file=lex1.yy.hpp -o lex1.yy.cpp scanner1.lpp && flex --header-file=lex2.yy.hpp -o lex2.yy.cpp scanner2.lpp && bison -do parser1.tab.cpp parser1.ypp && bison -do parser2.tab.cpp parser2.ypp && sed -i 's/^ *static$//' parser1.tab.cpp parser2.tab.cpp && g++ -shared -fPIC -o parser1.so -include lex1.yy.cpp parser1.tab.cpp && g++ -shared -fPIC -o parser2.so lex2.yy.cpp parser2.tab.cpp && g++ main.cpp -ldl && ./a.out

Now there is no bug. (And this is absolutely impossible to guess that one need this "sed" hack.)

(I found this bug because I write some program which generates bison parsers in run-time and then loads them. I have to add this "sed -i" hack to my program because of this bug and this is the most hacky hack in my life)

My system is Debian GNU/Linux 8. glibc 2.19, gcc 4.9.1, flex 2.5.39, bison 3.0.2
Comment 8 Romain Geissler 2017-06-16 15:41:01 UTC
Hi,

Sorry for digging into this old bug report, but we have just noticed we were also affected after a glibc/gcc upgrade.

Jason Merrill has described the rational behind GNU unique symbols in https://www.sourceware.org/ml/libc-alpha/2002-05/msg00222.html, and indeed the introduction of this GNU extension makes sense.

Yet what I don't understand is the piece of code that does flag a dlopen'ed library as NODELETE if one relocation is resolve into this library. In current trunk it is here: https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-lookup.c;h=3d2369dbf2b7ca219eaf80a820e2a8e1329fbf50;hb=92bd70fb85bce57ac47ba5d8af008736832c955a#l332

In this old 2002 post, Jason mentions solution #1 and #2, both of them being needed to work well.

#1 is "Always prefer the last weak definition if no strong definition is seen.". From what I see here: https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-lookup.c;h=3d2369dbf2b7ca219eaf80a820e2a8e1329fbf50;hb=92bd70fb85bce57ac47ba5d8af008736832c955a#l526 this is already the case in the trunk for years (at least this is the comment says, I am not sure this is what is really implemented).

#2 is "If a DSO A has two unrelated dependencies B and C which both define (and
   use) the same weak symbol, add C to the dependency list of this loaded
   copy of B."

Isn't this the point of "add_dependency" called here https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-lookup.c;h=3d2369dbf2b7ca219eaf80a820e2a8e1329fbf50;hb=92bd70fb85bce57ac47ba5d8af008736832c955a#l929 after the call to "do_lookup_x" which handles unique symbols in a specific way ?

My point is, unique symbols do create some bindings between libraries that normally should not be bound together. I am fine with that. However if I do unload all these dlopen'ed library, I would expect all dynamically dependencies added between the two RTLD_LOCAL object to be removed, and eventually all the libraries to be fully unloaded (with static destructor called). I am fine with "dlclose" not closing the library if another dlopened library has some symbols resolved into the first one, but then a second call to "dlclose" on the second library should unload both libraries.

So am I right thinking that marking libraries where unique relocation are resolved to as NODELETE is wrong ? All the mechanism is already there to properly handle the closing of libraries at the right time thanks to "add_dependency", isn't it ?

Cheers,
Romain