Created attachment 9795 [details] Minimal Example for the dlopen bug. During the development of a plugin framework for LAPACK functions I ran into trouble with dlopen/dlsym which seems to compute a wrong address or jump to a wrong address. The problem setup is the following (as short description to the attached code). I open the LAPACK (at least version 3.6.1) library using dlopen and search for the symbols "dgetrf", "dgetrf2" and "dgetf2" on start up of the application using the attribute(constructor) mechanism. For each of the three symbols I have a wrapping function with the exactly the same name and the same binary interface. Running the code on x86-64 everything works fine. If I now switch to ppc64le (OpenPower8 - triple: powerpc64le-linux-gnu) it crashes and I get the following backtrace in gdb: #0 0x000000000074696c in ?? () #1 0x00003fffb76af954 in dgetrf2_ () from /home/k/lapacktest/lib64/liblapack.so #2 0x00003fffb7f60c1c in dgetrf2_ (n=0x3fffffffec14, m=0x3fffffffe3f4, A=0x10046830, lda=0x3fffffffeda0, ipiv=0x1005a0c0, info=0x3fffffffe3f0) at liblapack-calls.c:31 #3 0x00003fffb76af8c8 in dgetrf2_ () from /home/k/lapacktest/lib64/liblapack.so #4 0x00003fffb7f60c1c in dgetrf2_ (n=0x3fffffffec14, m=0x3fffffffe594, A=0x10046830, lda=0x3fffffffeda0, ipiv=0x1005a0c0, info=0x3fffffffe590) at liblapack-calls.c:31 #5 0x00003fffb76af8c8 in dgetrf2_ () from /home/k/lapacktest/lib64/liblapack.so #6 0x00003fffb7f60c1c in dgetrf2_ (n=0x3fffffffec14, m=0x3fffffffe734, A=0x10046830, lda=0x3fffffffeda0, ipiv=0x1005a0c0, info=0x3fffffffe730) at liblapack-calls.c:31 #7 0x00003fffb76af8c8 in dgetrf2_ () from /home/k/lapacktest/lib64/liblapack.so #8 0x00003fffb7f60c1c in dgetrf2_ (n=0x3fffffffec14, m=0x3fffffffe8d4, A=0x10046830, lda=0x3fffffffeda0, ipiv=0x1005a0c0, info=0x3fffffffe8d0) at liblapack-calls.c:31 #9 0x00003fffb76af8c8 in dgetrf2_ () from /home/k/lapacktest/lib64/liblapack.so #10 0x00003fffb7f60c1c in dgetrf2_ (n=0x3fffffffec14, m=0x3fffffffea74, A=0x10046830, lda=0x3fffffffeda0, ipiv=0x1005a0c0, info=0x3fffffffea70) at liblapack-calls.c:31 #11 0x00003fffb76af8c8 in dgetrf2_ () from /home/k/lapacktest/lib64/liblapack.so #12 0x00003fffb7f60c1c in dgetrf2_ (n=0x3fffffffec14, m=0x3fffffffec0c, A=0x10046830, lda=0x3fffffffeda0, ipiv=0x1005a0c0, info=0x3fffffffec04) at liblapack-calls.c:31 #13 0x00003fffb76aee54 in dgetrf_ () from /home/k/lapacktest/lib64/liblapack.so #14 0x00003fffb7f60b8c in dgetrf_ (n=0x3fffffffeda0, m=0x3fffffffeda0, A=0x10046830, lda=0x3fffffffeda0, ipiv=0x1005a0c0, info=0x3fffffffeda4) at liblapack-calls.c:25 #15 0x0000000010000d70 in main (argc=2, argv=0x3ffffffff1d8) at lapack-test.c:58 where in frame #0 the address 0x000000000074696c is wrong. Correct would be 0x00003fffb7f60c1c as in frame #2. With some work I found out that it seems that inside the plt when the call is performed a wrong address is computed. I tried the code on two different versions of glibc and gcc. The first one was on a CentOS 7.3 with glibc 2.17 : gcc -v : Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/ppc64le-redhat-linux/4.8.5/lto-wrapper Target: ppc64le-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-ppc64le-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-ppc64le-redhat-linux/cloog-install --enable-gnu-indirect-function --enable-secureplt --with-long-double-128 --enable-targets=powerpcle-linux --disable-multilib --with-cpu-64=power8 --with-tune-64=power8 --build=ppc64le-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ld -v: GNU ld version 2.25.1-22.base.el7 The second one was a Debian Stretch with glibc 2.24: gcc -v: Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc64le-linux-gnu/6/lto-wrapper Target: powerpc64le-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 6.3.0-5' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=powerpc64le-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --enable-plugin --enable-default-pie --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-ppc64el/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-ppc64el --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-ppc64el --with-arch-directory=ppc64le --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc=auto --enable-secureplt --with-cpu=power8 --enable-targets=powerpcle-linux --disable-multilib --enable-multiarch --with-long-double-128 --enable-checking=release --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu Thread model: posix gcc version 6.3.0 20170124 (Debian 6.3.0-5) ld -v: GNU ld (GNU Binutils for Debian) 2.27.90.20170124 Both system running with Linux 4.8.6-300.el7.centos.ppc64le. But a second the confirmed that it also happens on 3.11 which originally shipped with CentOS. The attached tar-ball contains the minimal-not-working example on my systems. Using make it compiles LAPACK-3.6.1 and the example. By "make run" it executes the example and "make gdb" starts the example in gdb.
One additional note. The dgetrf2 function in LAPACK is a recursive function.
One further bug in the descriptions. The correct address in frame #0 is 0x3fffb77212d8 which points to the dlaswp function inside LAPACK.
Able to run the test after setting LD_PRELOAD.
The LD_PRELOAD mechanism is exacly what I want to avoid because I want to be able to exchange the pointers and the dlopen handle to switch LAPACK at runtime. Furthermore, on x86 and x86-64 it works without any problems.
Thanks for mentioning the recursion, it was helpful for identifying the root cause. This is a GCC code generation issue, affecting the POWER ELFv2 ABI: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79439
It seems that you are right. I compiled the same code again with the PGI compiler and it works as expected. Than lets wait what happens on the gcc side and thank you for you even shorter example.