Bug 27208 - abicompat doesn't consider references from the libraries to the main executable.
Summary: abicompat doesn't consider references from the libraries to the main executable.
Status: NEW
Alias: None
Product: libabigail
Classification: Unclassified
Component: default (show other bugs)
Version: unspecified
: P2 critical
Target Milestone: ---
Assignee: Dodji Seketeli
URL:
Keywords:
Depends on:
Blocks: 27019
  Show dependency treegraph
 
Reported: 2021-01-19 21:54 UTC by Ben Woodard
Modified: 2022-03-31 10:21 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
reproducer code. (520 bytes, application/x-gzip)
2021-01-19 21:54 UTC, Ben Woodard
Details
reproducer (1.50 KB, application/x-gzip)
2022-03-31 09:06 UTC, Ben Woodard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ben Woodard 2021-01-19 21:54:31 UTC
Created attachment 13135 [details]
reproducer code.

abicompat doesn't check calls from a library into functions provided by an application. It only checks that calls from the app into a library provide the same ABI. This is not sufficient to assert abi compatibility.

Say for example you have:

#include <stdio.h>

char *backcall( char *arg){
  return arg+1;
}

char *libcall( char *arg);

int main(int argc, char **argv){
  printf("%s\n", libcall(argv[0]));
  return 0;
}

and in a library you have:

// this would normally be in a header provided by the app
// it is just here to make the reproducer simple.
char *backcall( char *arg); 

char *libcall( char *arg){
  return backcall(arg);
}

However an older or a newer version of the app provided a different interface and so the library is:

// notice this is different
int backcall( char *arg);

char *libcall( char *arg){
  return arg+backcall(arg);
}

So abicompat should point out that the second version of the library is not compatible with the app but it doesn't.

$ abicompat app1 libt1.so libt2.so 
$ echo $?
0

The problem is that abicompat doesn't perform 2 way comparison with the library. it doesn't mark the call from the library to the application as needed then compare that to what the application provides.

This programming style is not uncommon. It happens anytime there is a defined plugin interface
Comment 1 Ben Woodard 2021-01-19 22:12:09 UTC
I think what needs to be done to fix this is in perform_compat_check_in_normal_mode once you have gone through the undefined symbol list from the app to the library. You need to go through all the undefined symbols in the library and if they are defined in the app, then mark them as needed in the library before the comparison.

(I haven't looked at the weak mode checking)
Comment 2 Ben Woodard 2021-03-04 18:12:32 UTC
After thinking about it more I believe that this bug is more critical than initially thought since it undermines the accuracy of the abicompat's analysis making it suspect.
Comment 3 Ben Woodard 2021-05-17 18:14:27 UTC
A much simpler manifestation of this problem is:

$ abicompat -u libt1.so 
__cxa_finalize@@GLIBC_2.2.5
$ 

The symbol in libt1.so which represents the backcall is not listed amongst the undefined symbols.

However nm does pick it up:
[ben@alien abi-2way]$ nm libt1.so | grep "U "
                 U _Z8backcallPc

The problem ends up being that libabigail doesn't consider ELF symbols that are underlinked. In the ELF the symbols do not take on a type recognized as a function:

$ eu-readelf -s libt1.so
<snip>

Symbol table [30] '.symtab' contains 55 entries:
 49 local symbols  String table: [31] '.strtab'
  Num:            Value   Size Type    Bind   Vis          Ndx Name
<snip>
   51: 0000000000000000      0 NOTYPE  GLOBAL DEFAULT    UNDEF _Z8backcallPc
<snip>

Currently libabigail doesn't consider NOTYPE symbols so they are not included in the list of functions or variables returned by get_sorted_undefined_fun_symbols() and get_sorted_undefined_var_symbols()

It could be argued that best practices would be to link everything with "-z defs" and then this problem would be flagged. The problem with that approach is that there are large code bases that intentionally use underlinking for plugins. There is no real way to fully link a plugin's .so to an application with -l <application>.

To fix that problem one would have to refactor the code so that all the functions and variables used by the plugin live in another .so then both the application and the plugin would link with that. However, fixing the problem that way would make https://sourceware.org/bugzilla/show_bug.cgi?id=27514 much more acute.

We can begin to educate users to not underlink their code but libabigail still needs to be able to detect the cases where there is underlinked but ABI significant changes. 
This means that libabigail needs to consider ELF symbols of NOTYPE.

One way of determining if NOTYPE symbol may be a function or a variable would be to look at the kind of relocation it is given: 

$ eu-readelf -r libt1.so
<snip>

Relocation section [ 8] '.rela.plt' for section [20] '.got.plt' at offset 0x4b8 contains 2 entries:
  Offset              Type            Value               Addend Name
  0x0000000000004018  X86_64_JUMP_SLOT 000000000000000000      +0 __cxa_finalize
  0x0000000000004020  X86_64_JUMP_SLOT 000000000000000000      +0 _Z8backcallPc

However relocations are architecture specific and therefore all the various kinds of relocations would need to be handled for all the various architectures.

Another approach could be to use the DWARF to decide if a NOTYPE symbol is a variable or a function or something else. The problem is currently all the ELF processing is done before any DWARF is ever looked at. During that iteration through the ELF symbols NOTYPE symbols are discarded and so they are never considered when the DWARF is eventually read.

It was tentatively agreed upon that libabigail would need to hold onto the NOTYPE symbols during ELF processing. Then later on when the DWARF is parsed, the NOTYPE symbols could be sorted into functions and variables and others.

At least with current compilers there are DIEs for underlinked symbols:
]$ eu-readelf --debug-dump=info libt1.so 

DWARF section [25] '.debug_info' at offset 0x31a8:
 [Offset]
 Compilation unit at offset 0:
 Version: 5, Abbreviation section offset: 0, Address size: 8, Offset size: 4
 Unit type: compile (1)
 [     c]  compile_unit         abbrev: 1
           producer             (strp) "GNU C++17 11.1.1 20210428 (Red Hat 11.1.1-1) -mtune=generic -march=x86-64 -g"
           language             (data1) C_plus_plus_14 (33)
           name                 (line_strp) "lib1.C"
           comp_dir             (line_strp) "/home/ben/Shared/test/abi-2way"
           low_pc               (addr) +0x0000000000001109 <_Z7libcallPc>
           high_pc              (data8) 26 (+0x0000000000001123)
           stmt_list            (sec_offset) 0
 [    2e]    subprogram           abbrev: 2
             external             (flag_present) yes
             name                 (strp) "backcall"
             decl_file            (data1) lib1.C (1)
             decl_line            (data1) 1
             decl_column          (data1) 7
             linkage_name         (strp) "_Z8backcallPc"
             type                 (ref4) [    48]
             declaration          (flag_present) yes
             sibling              (ref4) [    48]
 [    42]      formal_parameter     abbrev: 3
               type                 (ref4) [    48]
 [    48]    pointer_type         abbrev: 4
             byte_size            (data1) 8
             type                 (ref4) [    4e]
 [    4e]    base_type            abbrev: 5
             byte_size            (data1) 1
             encoding             (data1) signed_char (6)
             name                 (strp) "char"
<snip>

The functions like get_sorted_undefined_fun_symbols() and get_sorted_undefined_var_symbols() would need to merge in the NOTYPE ELF symbols in their iterators.

This becomes an annoying problem but it needs to be solved.
Comment 4 Ben Woodard 2022-03-29 20:41:58 UTC
Splitting off 29008 the problem with NOTYPE ELF symbols from this issue. The original reproducer that I constructed demonstrated a different problem. I would like to limit this to references from the library to symbols back in the main executable.
Comment 5 Ben Woodard 2022-03-31 09:06:26 UTC
Created attachment 14043 [details]
reproducer

Replace the reproducer code which actually pointed out the NOTYPE problem with one that points out the actual problem with ABI references from the library to the main executable.
Comment 6 Ben Woodard 2022-03-31 10:08:57 UTC
Darn, the reproducer still isn't exactly right. The trick that I used to work around the NOTYPE problem was to use a weak symbol in the library. This wouldn't show up as undefined and so it wouldn't be considered when doing ABI checking. So instead of finding a way around the NOTYPE linking problem to show the problem with backcalls, I inadvertently uncovered yet another logical bug in abicompat. It doesn't take into account weak symbol replacement. I'll fork that off to another bug.
Comment 7 Ben Woodard 2022-03-31 10:21:06 UTC
Split off the weak symbol replacement problem to https://sourceware.org/bugzilla/show_bug.cgi?id=29013