Created attachment 13135 [details] reproducer code. abicompat doesn't check calls from a library into functions provided by an application. It only checks that calls from the app into a library provide the same ABI. This is not sufficient to assert abi compatibility. Say for example you have: #include <stdio.h> char *backcall( char *arg){ return arg+1; } char *libcall( char *arg); int main(int argc, char **argv){ printf("%s\n", libcall(argv[0])); return 0; } and in a library you have: // this would normally be in a header provided by the app // it is just here to make the reproducer simple. char *backcall( char *arg); char *libcall( char *arg){ return backcall(arg); } However an older or a newer version of the app provided a different interface and so the library is: // notice this is different int backcall( char *arg); char *libcall( char *arg){ return arg+backcall(arg); } So abicompat should point out that the second version of the library is not compatible with the app but it doesn't. $ abicompat app1 libt1.so libt2.so $ echo $? 0 The problem is that abicompat doesn't perform 2 way comparison with the library. it doesn't mark the call from the library to the application as needed then compare that to what the application provides. This programming style is not uncommon. It happens anytime there is a defined plugin interface
I think what needs to be done to fix this is in perform_compat_check_in_normal_mode once you have gone through the undefined symbol list from the app to the library. You need to go through all the undefined symbols in the library and if they are defined in the app, then mark them as needed in the library before the comparison. (I haven't looked at the weak mode checking)
After thinking about it more I believe that this bug is more critical than initially thought since it undermines the accuracy of the abicompat's analysis making it suspect.
A much simpler manifestation of this problem is: $ abicompat -u libt1.so __cxa_finalize@@GLIBC_2.2.5 $ The symbol in libt1.so which represents the backcall is not listed amongst the undefined symbols. However nm does pick it up: [ben@alien abi-2way]$ nm libt1.so | grep "U " U _Z8backcallPc The problem ends up being that libabigail doesn't consider ELF symbols that are underlinked. In the ELF the symbols do not take on a type recognized as a function: $ eu-readelf -s libt1.so <snip> Symbol table [30] '.symtab' contains 55 entries: 49 local symbols String table: [31] '.strtab' Num: Value Size Type Bind Vis Ndx Name <snip> 51: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UNDEF _Z8backcallPc <snip> Currently libabigail doesn't consider NOTYPE symbols so they are not included in the list of functions or variables returned by get_sorted_undefined_fun_symbols() and get_sorted_undefined_var_symbols() It could be argued that best practices would be to link everything with "-z defs" and then this problem would be flagged. The problem with that approach is that there are large code bases that intentionally use underlinking for plugins. There is no real way to fully link a plugin's .so to an application with -l <application>. To fix that problem one would have to refactor the code so that all the functions and variables used by the plugin live in another .so then both the application and the plugin would link with that. However, fixing the problem that way would make https://sourceware.org/bugzilla/show_bug.cgi?id=27514 much more acute. We can begin to educate users to not underlink their code but libabigail still needs to be able to detect the cases where there is underlinked but ABI significant changes. This means that libabigail needs to consider ELF symbols of NOTYPE. One way of determining if NOTYPE symbol may be a function or a variable would be to look at the kind of relocation it is given: $ eu-readelf -r libt1.so <snip> Relocation section [ 8] '.rela.plt' for section [20] '.got.plt' at offset 0x4b8 contains 2 entries: Offset Type Value Addend Name 0x0000000000004018 X86_64_JUMP_SLOT 000000000000000000 +0 __cxa_finalize 0x0000000000004020 X86_64_JUMP_SLOT 000000000000000000 +0 _Z8backcallPc However relocations are architecture specific and therefore all the various kinds of relocations would need to be handled for all the various architectures. Another approach could be to use the DWARF to decide if a NOTYPE symbol is a variable or a function or something else. The problem is currently all the ELF processing is done before any DWARF is ever looked at. During that iteration through the ELF symbols NOTYPE symbols are discarded and so they are never considered when the DWARF is eventually read. It was tentatively agreed upon that libabigail would need to hold onto the NOTYPE symbols during ELF processing. Then later on when the DWARF is parsed, the NOTYPE symbols could be sorted into functions and variables and others. At least with current compilers there are DIEs for underlinked symbols: ]$ eu-readelf --debug-dump=info libt1.so DWARF section [25] '.debug_info' at offset 0x31a8: [Offset] Compilation unit at offset 0: Version: 5, Abbreviation section offset: 0, Address size: 8, Offset size: 4 Unit type: compile (1) [ c] compile_unit abbrev: 1 producer (strp) "GNU C++17 11.1.1 20210428 (Red Hat 11.1.1-1) -mtune=generic -march=x86-64 -g" language (data1) C_plus_plus_14 (33) name (line_strp) "lib1.C" comp_dir (line_strp) "/home/ben/Shared/test/abi-2way" low_pc (addr) +0x0000000000001109 <_Z7libcallPc> high_pc (data8) 26 (+0x0000000000001123) stmt_list (sec_offset) 0 [ 2e] subprogram abbrev: 2 external (flag_present) yes name (strp) "backcall" decl_file (data1) lib1.C (1) decl_line (data1) 1 decl_column (data1) 7 linkage_name (strp) "_Z8backcallPc" type (ref4) [ 48] declaration (flag_present) yes sibling (ref4) [ 48] [ 42] formal_parameter abbrev: 3 type (ref4) [ 48] [ 48] pointer_type abbrev: 4 byte_size (data1) 8 type (ref4) [ 4e] [ 4e] base_type abbrev: 5 byte_size (data1) 1 encoding (data1) signed_char (6) name (strp) "char" <snip> The functions like get_sorted_undefined_fun_symbols() and get_sorted_undefined_var_symbols() would need to merge in the NOTYPE ELF symbols in their iterators. This becomes an annoying problem but it needs to be solved.
Splitting off 29008 the problem with NOTYPE ELF symbols from this issue. The original reproducer that I constructed demonstrated a different problem. I would like to limit this to references from the library to symbols back in the main executable.
Created attachment 14043 [details] reproducer Replace the reproducer code which actually pointed out the NOTYPE problem with one that points out the actual problem with ABI references from the library to the main executable.
Darn, the reproducer still isn't exactly right. The trick that I used to work around the NOTYPE problem was to use a weak symbol in the library. This wouldn't show up as undefined and so it wouldn't be considered when doing ABI checking. So instead of finding a way around the NOTYPE linking problem to show the problem with backcalls, I inadvertently uncovered yet another logical bug in abicompat. It doesn't take into account weak symbol replacement. I'll fork that off to another bug.
Split off the weak symbol replacement problem to https://sourceware.org/bugzilla/show_bug.cgi?id=29013