26646 – unexpected declaration-only types

Bug 26646 - unexpected declaration-only types

Summary: unexpected declaration-only types

Status:	ASSIGNED

Alias:	None

Product:	libabigail
Classification:	Unclassified
Component:	default (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Dodji Seketeli

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-09-22 11:35 UTC by Giuliano Procida
Modified:	2022-06-08 15:43 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:	2022-02-10 00:00:00

Attachments
XML extracted from attached kernel file, compressed (675.02 KB, application/gzip) 2022-01-18 11:20 UTC, gprocida	Details
Second candidate patch (2.24 KB, patch) 2022-02-10 17:33 UTC, Dodji Seketeli	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Giuliano Procida 2020-09-22 11:35:01 UTC

We've noticed for some time that we get a mix of declarations and definitions appearing in kernel ABI XML for the same types.

In some cases, only a declaration appears despite there being a full definition in the DWARF. Often, this declaration has a "size" despite the size information being associated with the full DWARF definition, not the declaration.

I've sent in a patch which illustrates this bug (by changing the order in which definitions and declarations are canonicalised). If need be, I can send over a vmlinux.

Patch: https://sourceware.org/pipermail/libabigail/2020q3/002679.html

The patch serves as workaround for around 160 kernel types which appear incomplete in the ABI.

Comment 1 Dodji Seketeli 2020-09-22 12:15:13 UTC

Hello

"gprocida+abigail at google dot com" <sourceware-bugzilla@sourceware.org> writes:

> We've noticed for some time that we get a mix of declarations and definitions
> appearing in kernel ABI XML for the same types.
>
> In some cases, only a declaration appears despite there being a full definition
> in the DWARF. Often, this declaration has a "size" despite the size information
> being associated with the full DWARF definition, not the declaration.
>
> I've sent in a patch which illustrates this bug (by changing the order in which
> definitions and declarations are canonicalised). If need be, I can send over a
> vmlinux.

I would definitely want to get my hands on a vmlinux which exhibits the
issue, yes, please.

> Patch: https://sourceware.org/pipermail/libabigail/2020q3/002679.html

Thanks.

> The patch serves as workaround for around 160 kernel types which appear
> incomplete in the ABI.

Right, I'd prefer fixing the root cause.  So as I said above, the
vmlinux would be greatly appreciated.

Thanks!

Comment 2 Giuliano Procida 2020-10-06 21:02:38 UTC

I've dropped an AOSP vmlinux here: https://github.com/myxoid/libabigail/commit/51fd7ce8d9b896bf4736bb638266bda5c761ee8a. Uncompressed it's about 0.25G.

The reordering change (the previous commit) eliminate 168 declaration-only types from the ABI XML.

Sorry this took so long.

Comment 3 Giuliano Procida 2020-10-07 09:33:22 UTC

I'd recommend doing the following.

build/tools/abidw --no-comp-dir-path --no-show-locs --type-id-style hash vmlinux mainline.xml.master

cherry-pick the patch https://sourceware.org/pipermail/libabigail/2020q3/002679.html
and rebuild

build/tools/abidw --no-comp-dir-path --no-show-locs --type-id-style hash vmlinux mainline.xml.ordered

Despite the abidw options given, it's a bit much to try to diff these by hand as there are thousands of differences. Analysis shows 168 fewer declaration-only (only) types in the latter XML.

*Assuming* both XML files are in some sense valid, abidiff can illustrate other potential problems. It would make sense to open separate Bugzilla issues, if confirmed.

Do 4 abidiffs (with / without --harmless, with / without --leaf-changes-only) between mainline.xml.master and mainline.xml.ordered.

abidiff - shows lots of stuff being filtered out (hmmm)

abidiff --harmless - shows diffs like the following (bug?)

            type changed from:
              union {snd_info_entry_text text; const snd_info_entry_ops* ops;}
            to:
              union {snd_info_entry_text text; const snd_info_entry_ops* ops;}

abidiff --leaf-changes-only - shows some type changes filtered out

abidiff --leaf-changes-only --harmless - shows there's at least one bug in reporting (perhaps due to the same anonymous union issue?)

Lastly, some of these abidiff runs take a long time (up to 7 minutes on a fast machine), but that may be expected given the large XML differences.

Comment 4 Giuliano Procida 2020-11-23 17:06:02 UTC

This should be revisited once the latest canonicalisation changes are in and things have settled down.

Comment 5 gprocida 2022-01-17 17:46:28 UTC

A lot of changes and fixes have happened (including to the Clang compiler) since this bug was originally opened.

I have a new test case which is very stable in terms of how abidw is built. I get identical XML for each of two Android vmlinux objects, regardless of whether built from upstream master (modulo disabling one assertion in the symtab reader) or AOSP or the Google monorepo builds.

Between the two kernels, the ABIs differ only in that two types that were declared in the first ABI appear fully defined in the second. For example:

<class-decl name='ip_mc_list' is-struct='yes' visibility='default' is-declaration-only='yes' id='c2a59aaa'/>

vs

    <class-decl name='ip_mc_list' size-in-bits='1152' is-struct='yes' visibility='default' id='c2a59aaa'>
      <data-member access='public' layout-offset-in-bits='0'>
...


Looking at the 2 DWARF dumps, locally at least, nothing has changed in relation to the one type I checked. Both files have 1 declaration and 14 definitions of the type.

Tracing through abidw execution, I see differing numbers of add_or_update_class_type calls for that type.

In particular, there is only one such call for first vmlinux object and I'm guessing it's for the only declaration-only DIE in the DWARF (as both appear first in the DWARF dump and in the debug output).

I also noticed that the wrapper build_ir_node_from_die function hard codes as true the is_declaration_only parameter passed to the other build_ir_node_from_die function. 

If I interpose a check for the specific type name and flip this flag to false, then I *do* see a fully-defined type in the output XML.

I think the test files and this information should be enough for a proper investigation. Unfortunately, the files are rather large. If you only need to work with the first kernel, size requirements are halved, but still too large for Bugzilla.

$ ls -lh declarations.tar.bz2 vmlinux[34]{,.abidw_g3}
-rw-r----- 1 gprocida primarygroup 264M Jan 17 17:19 declarations.tar.bz2
-rw-r----- 1 gprocida primarygroup 471M Jan 17 14:14 vmlinux3
-rw-r----- 1 gprocida primarygroup 7.5M Jan 17 14:32 vmlinux3.abidw_g3
-rw-r----- 1 gprocida primarygroup 471M Jan 17 14:16 vmlinux4
-rw-r----- 1 gprocida primarygroup 7.5M Jan 17 14:32 vmlinux4.abidw_g3

I was testing with ip_mc_list. These are two differences in the ABIs:

$ stgdiff --abi --format small /tmp/vmlinux{3,4}.abidw_g3 -o /dev/stdout
type 'struct udp_table' changed
  was only declared, is now fully defined

type 'struct ip_mc_list' changed
  was only declared, is now fully defined

Let me know how I can get these files to you. Alternatively, if you think you have a quick fix, I'd be happy to try it out.

Comment 6 gprocida 2022-01-18 09:37:59 UTC

A little more debugging...

It is the called_from_public_decl check that prevents the fully-defined type being built. The code change between the two kernel versions seems to have nothing at all to do with the the two types affected, so perhaps something is going wrong in this area.

One way to override this and avoid losing definitions is --load-all-types.

This option expensive in processing time and in the size of XML file generated. The size is less of a problem (as we have a tool to remove all types that cannot be reached from a symbol or another reachable type) but the processing time may be. We will look into using this option pending a resolution of the problem.

Comment 7 gprocida 2022-01-18 11:20:36 UTC

Created attachment 13912 [details]
XML extracted from attached kernel file, compressed

abidw --type-id-style hash --no-show-locs --no-comp-dir-path vmlinux3 > vmlinux3.abidw_g3

Comment 8 gprocida 2022-01-18 11:20:51 UTC

Kernel file:

https://ci.android.com/builds/submitted/8082661/kernel_abi_aarch64/latest - vmlinux (470M) - this is what I've been calling vmlinux3

I don't think you actually need the other vmlinux and, as it was a presubmit test artefact, I don't think it's public.

Comment 9 gprocida 2022-01-18 11:32:08 UTC

FTR, with --load-all-types:

* vmlinux3 ends up with 102 fewer declaration-only types
* vmlinux4 ends up with 100 fewer declaration-only types
* overall they both end up with same defined and declaration-only types

Comment 10 Dodji Seketeli 2022-01-24 15:07:07 UTC

Hello,

Here is my analysis on what is going on.

Let's look at the DWARF output and the abixml of the vmlinux3 binary.

From the abixml, we see that the struct ip_mc_list is used from the struct in_device type.  For instance, the ip_mc_list::mc_list data member is of type "pointer to decl-only struct ip_mc_list".

Moreover, the only use of struct ip_mc_list is through a "pointer to decl-only struct ip_mc_list" use.

Let's look at the DWARF to see if we see the same.

[5ebb608]    structure_type       abbrev: 25
             name                 (strp) "in_device"
             byte_size            (data2) 352
             decl_file            (data1) inetdevice.h (122)
             decl_line            (data1) 25

This is the definition DIE of the struct in_device type.

[...]
     [5ebb641]      member               abbrev: 4
                   name                 (strp) "mc_list"
                   type                 (ref4) [5ebb7f7]
                   decl_file            (data1) inetdevice.h (122)
                   decl_line            (data1) 31
                   data_member_location (data1) 24

This is the definition of the "in_device::mc_list" data member.

Its type is the DIE [5ebb7f7] ...

[...]
 
[5ebb7f7]    pointer_type         abbrev: 5
             type                 (ref4) [5ebb7fc]

Here we see that the DIE [5ebb7f7] is a pointer ...

[5ebb7fc]    structure_type       abbrev: 12
            name                 (strp) "ip_mc_list"
            declaration          (flag_present) yes

... to a struct ip_mc_list that is declaration-only.

Looking through the DWARF, there no public exported function that refers directly or indirectly to the full definition of ip_mc_list, as defined by this DIE, for instance:

 [9b97e73]    structure_type       abbrev: 3
             name                 (strp) "ip_mc_list"
             byte_size            (data1) 144
             decl_file            (data1) igmp.h (228)
             decl_line            (data1) 67
[...]

So, this is why libabigail considers that the /definition/ of the struct ip_mc_list is not used by any function that is publicly defined.  So, that struct definition is not part of the ABI.

What am I missing?

Comment 11 gprocida 2022-01-24 17:29:08 UTC

Hi Dodji.

Thanks for digging into this. I don't really have any tools for working with DWARF other than dwarfdump, grep etc.

For vmlinux3, I see 1 declaration

< 1><0x0000e14b>    DW_TAG_structure_type
                      DW_AT_name                  ip_mc_list
                      DW_AT_declaration           yes(1)

and 13 definitions starting with this one

< 1><0x00018293>    DW_TAG_structure_type
                      DW_AT_name                  ip_mc_list
                      DW_AT_byte_size             0x00000090

Tracing back all 13 definitions to potential functions and variables was beyond what I could achieve.

If all 13 are indeed not reachable, that does explain what's going on.

Note that I see identical 1 + 13 DIEs for vmlinux4 but there we *do* get a definition in the XML.

Regards,
Giuliano.

Comment 12 gprocida 2022-02-07 10:40:53 UTC

I wrote a hacky readelf output to DIE graph script. For both kernels, I can find chains of DIEs:

function dev_get_by_name returns type struct net_device *
struct net_device has a member ip_ptr of type struct in_device *
struct in_device has a member mc_list of type struct ip_mc_list

Comment 13 Dodji Seketeli 2022-02-07 11:19:26 UTC

Indeed, looking at the combination of those two kernels I can see the problem now, thanks!

I have submitted two patches to at https://sourceware.org/pipermail/libabigail/2022q1/004145.html. Could you please review/test them to see if they address the issue as you see it?

Thanks!

Comment 14 gprocida 2022-02-07 17:47:26 UTC

The other type that is different between vmlinux3 and vmlinux4 is struct udp_table.

I haven't yet traced a shortest path from a symbol to the full definition yet. In theory this should be straightforward based on the XML for the second kernel, but in practice I need to write some more code to automate the process.

I've found a ridiculously long path but it would be much too painful to verify by hand.

Note that I suspect there could be up to around 100 types which could be assigned full definitions.

Comment 15 gprocida 2022-02-07 21:15:10 UTC

Hi Dodji.

I collected shortest paths from symbols to struct udp_table in the ABI for vmlinux4 and I've traced one by hand through the DWARF DIEs (readelf --debug-dump) for vmlinux3:

function symbol dev_get_by_name (in ksymtab) has a parameter of type struct net *                                                      
struct net has member rtln of type struct sock *                                                                          
struct sock has member sk_prot_creator of type struct proto *                                                             
struct proto has member h of an anonymous union type                                                                      
the anonymous union type has member udp_table of type struct udp_table *                                                  
struct udp_table is fully defined

If you can see why this path is not being considered by the DWARF reader, I think the fix will address more than just this one instance.

Regards,
Giuliano.

Comment 16 gprocida 2022-02-10 16:46:24 UTC

I've done a bit more debugging.

Tracing through DIE processing for both vmlinux{3,4} I see that in the former case, the first and only DIE for struct udp_table that is reached with "called from public decl" true is a declaration-only DIE.

So something is preventing the DWARF reader from seeing the (later) definition reachable from dev_get_by_name.

Tracing through what happens with dev_get_by_name, I see that recursion stops at struct net * which has already been loaded.

In turn that was loaded by function netif_set_real_num_tx_queues -> struct net_device * -> struct net_device -> struct in_device * -> struct in_device -> struct net *. This continues struct net -> struct sock * -> struct sock -> struct proto * -> struct proto. Recursion seems to stop at struct proto, but I'm not sure why, yet.

Comment 17 gprocida 2022-02-10 17:21:09 UTC

Actually, there are ~15 occurrences of struct proto DIEs being reached and I haven't traced through all the code to work out why it never gets to a full definition of struct udp_table.

Comment 18 Dodji Seketeli 2022-02-10 17:33:34 UTC

Created attachment 13970 [details]
Second candidate patch

This patch is the second attempt at fixing the issue and should address specifically the comment https://sourceware.org/bugzilla/show_bug.cgi?id=26646#c15

Comment 19 Dodji Seketeli 2022-02-10 17:36:42 UTC

(In reply to gprocida from comment #15)
> Hi Dodji.
> 
> I collected shortest paths from symbols to struct udp_table in the ABI for
> vmlinux4 and I've traced one by hand through the DWARF DIEs (readelf
> --debug-dump) for vmlinux3:
> 
> function symbol dev_get_by_name (in ksymtab) has a parameter of type struct
> net *                                                      
> struct net has member rtln of type struct sock *                            
> 
> struct sock has member sk_prot_creator of type struct proto *               
> 
> struct proto has member h of an anonymous union type                        
> 
> the anonymous union type has member udp_table of type struct udp_table *    
> 
> struct udp_table is fully defined
> 
> If you can see why this path is not being considered by the DWARF reader, I
> think the fix will address more than just this one instance.
> 
> Regards,
> Giuliano.

Thanks a lot for this analysis, it's super useful.

I came up with this patch https://sourceware.org/bugzilla/attachment.cgi?id=13970.  Would it move the needle a bit more?

Comment 20 gprocida 2022-02-10 18:53:18 UTC

Hi.

I've tested the latest patch.

It increased the number of fully-defined types for vmlinux3! Unfortunately, while it also added the same type definition for vmlinux4, our old friend ip_mc_list became declaration-only. :-(

I still think it is a step in the right direction. Certainly the recursion limit of 5 aggregates was arbitrary and small unit tests would be unlikely to exercise the code.

In terms of the code...

I think it would be better to store pairs of offsets though it would probably be extremely hard to find an example where this would make any difference.

It's safe to recurse a bit too much, so long as there is some termination guarantee, but potentially unsafe to stop recursion too early which is theoretically possible if you compare DIEs A and B and deeper in the recursion want to compare C and B.

The comparison algorithm I put together tracks pairs of nodes compared as this corresponds exactly to the design aim of avoiding repeated comparisons.

Giuliano.

Comment 21 gprocida 2022-02-11 12:51:55 UTC

For vmlinux4, I've found one (and only one) direct path to a full definition of struct ip_mc_list:

9f34bf9 function udp4_hwcsum (in ksymtab)
9f34c0f parameter skb
9f1b29c pointer
9f1b2a1 struct sk_buff
9f1b2aa anonymous member
9f1b2b3 anonymous union
9f1b2c1 anonymous struct
9f1b2e9 anonymous union
9f1b2ee member dev
9f1a5ad pointer
9f1a5b2 struct net_device
9f1a9c2 member ip_ptr
9f2e287 pointer
9f2e28c struct in_device
9f2e2c5 member mc_list
9f2e47b pointer
9f2e480 struct ip_mc_list (fully defined)

The DIE offsets are as reported by readelf --debug-dump.

I hope this is useful.

Comment 22 gprocida 2022-02-11 13:00:26 UTC

I missed out a couple of anonymous members.

9f34bf9 function udp4_hwcsum (in ksymtab)
9f34c0f parameter skb
9f1b29c pointer
9f1b2a1 struct sk_buff
9f1b2aa anonymous member
9f1b2b3 anonymous union
9f1b2b8 anonymous member
9f1b2c1 anonymous struct
9f1b2e0 anonymous member
9f1b2e9 anonymous union
9f1b2ee member dev
9f1a5ad pointer
9f1a5b2 struct net_device
9f1a9c2 member ip_ptr
9f2e287 pointer
9f2e28c struct in_device
9f2e2c5 member mc_list
9f2e47b pointer
9f2e480 struct ip_mc_list (fully defined)

Comment 23 Dodji Seketeli 2022-02-24 11:09:09 UTC

Thanks for the analysis, very useful as usual.

Would the content of this branch (PR26646) fix the issues you are seeing?

https://sourceware.org/git/?p=libabigail.git;a=shortlog;h=refs/heads/PR26646

I still need to address your comment https://sourceware.org/bugzilla/show_bug.cgi?id=26646#c20, but I wanted to fix the ip_mc_list decl-only-ness first.

Comment 24 gprocida 2022-02-24 12:16:26 UTC

Hi.

I ran some quick tests - the libabigail test suite and on the two kernels.

Good

* vmlinux3 and vmlinux4 get ABIs which are the same, including declaration-only / fully-defined status
* both have more fully-defined types than the previous iteration (around 9)
* a Linux test case looks better (some declaration-only duplicates vanish)

Not so good

* struct can_dev_rcv_lists is now declaration-only (w.r.t. to the original baseline ABIs)
* struct prefix_info too (w.r.t. an intermediate code version)
* there is a test case regression (nmap)

I'll do some more digging, but probably nothing directly with these two types right now.

Someone here pointed out to me that Clang and GCC have "type homing" logic to reduce the definitions that get emitted into DWARF to what's needed (leaving some types declaration-only). The relevant flags (to disable this) are -fstandalone-debug and -femit-class-debug-always, respectively.

You might decide that if either of these flags makes a difference to libabigail output then there's a bug somewhere in the compiler, linker or libabigail. Or it might be the case that disabling the logic exposes the types behind opaque pointers in a way which is unwanted.

In any case, I'm going to see what sort of impact -fstandalone-debug has on ABIs and kernel build time and size.

Regards,
Giuliano.

Comment 25 Dodji Seketeli 2022-02-24 15:54:27 UTC

(In reply to gprocida from comment #24)
> Hi.
> 
> I ran some quick tests - the libabigail test suite and on the two kernels.

Thanks!

> Good
> 
> * vmlinux3 and vmlinux4 get ABIs which are the same, including
> declaration-only / fully-defined status
> * both have more fully-defined types than the previous iteration (around 9)
> * a Linux test case looks better (some declaration-only duplicates vanish)

Good to know.

> 
> Not so good
> 
> * struct can_dev_rcv_lists is now declaration-only (w.r.t. to the original
> baseline ABIs)
> * struct prefix_info too (w.r.t. an intermediate code version)

OK, I haven't analyzed this, is it sure that it was supposed to be fully declared?

> * there is a test case regression (nmap)

OK, I think I have the reason for this. It seems to have been an unfortunate patch that snuck into the tree.  I have removed it and re-push the branch.  You can fetch it again and test.  This one should disappear.

If the regression disappears, I guess we can say that the current state is an improvement.  If so, I'll clean-up and post the patches on the list, if you agree.  Further investigation will continue after that.

Thanks.

Comment 26 gprocida 2022-02-24 16:05:18 UTC

Hi.

(In reply to dodji from comment #25)
> (In reply to gprocida from comment #24)
> > * struct can_dev_rcv_lists is now declaration-only (w.r.t. to the original
> > baseline ABIs)
> > * struct prefix_info too (w.r.t. an intermediate code version)
> 
> OK, I haven't analyzed this, is it sure that it was supposed to be fully
> declared?

Good question. I have to assume there is some set of paths where we can get from the symbol to the full definition, either by going down a path or by jumping from a declaration in one path to a definition in another path. However, I have not tried to code up this idea!

> > * there is a test case regression (nmap)
> 
> OK, I think I have the reason for this. It seems to have been an unfortunate
> patch that snuck into the tree.  I have removed it and re-push the branch. 
> You can fetch it again and test.  This one should disappear.

It has.

> If the regression disappears, I guess we can say that the current state is
> an improvement.  If so, I'll clean-up and post the patches on the list, if
> you agree.  Further investigation will continue after that.

Agreed. Thanks!

Comment 27 Dodji Seketeli 2022-02-28 09:59:44 UTC

(In reply to gprocida from comment #20)

[...]

> In terms of the code...
> 
> I think it would be better to store pairs of offsets though it would
> probably be extremely hard to find an example where this would make any
> difference.

I have update the patch to do just that.

I posted it for review at https://sourceware.org/pipermail/libabigail/2022q1/004178.html.

If it works for you, I'll be glad to apply it.

[...]

Thanks.

Comment 28 Dodji Seketeli 2022-02-28 10:01:00 UTC

(In reply to gprocida from comment #26)

> > If the regression disappears, I guess we can say that the current state is
> > an improvement.  If so, I'll clean-up and post the patches on the list, if
> > you agree.  Further investigation will continue after that.
> 
> Agreed. Thanks!

I have posted the cleaned-up patch for review at https://sourceware.org/pipermail/libabigail/2022q1/004179.html.

Thanks.

Comment 29 gprocida 2022-03-01 14:34:45 UTC

Hi.

On Mon, 28 Feb 2022 at 09:59, dodji at redhat dot com
<sourceware-bugzilla@sourceware.org> wrote:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=26646
>
> --- Comment #27 from dodji at redhat dot com ---
> (In reply to gprocida from comment #20)
>
> [...]
>
> > In terms of the code...
> >
> > I think it would be better to store pairs of offsets though it would
> > probably be extremely hard to find an example where this would make any
> > difference.
>
> I have update the patch to do just that.
>
> I posted it for review at
> https://sourceware.org/pipermail/libabigail/2022q1/004178.html.
>

Two (overlapping) things:

1. I wouldn't bother with inserting / testing / erasing the reverse
pair o(p2, p1)

It might cost more than it ever saves. But it would take
instrumentation to see if it ever makes a difference.

If you make this change then s/unordered pair/ordered pair/ in the doc comments.

2. There is a typo.

One of the reverse pairs is written as o(p2, 1).

Otherwise, it looks good to me.

Giuliano.

> If it works for you, I'll be glad to apply it.
>
> [...]
>
> Thanks.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 30 gprocida 2022-03-01 14:40:28 UTC

Hi Dodji.

On Mon, 28 Feb 2022 at 10:01, dodji at redhat dot com
<sourceware-bugzilla@sourceware.org> wrote:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=26646
>
> --- Comment #28 from dodji at redhat dot com ---
> (In reply to gprocida from comment #26)
>
> > > If the regression disappears, I guess we can say that the current state is
> > > an improvement.  If so, I'll clean-up and post the patches on the list, if
> > > you agree.  Further investigation will continue after that.
> >
> > Agreed. Thanks!
>
> I have posted the cleaned-up patch for review at
> https://sourceware.org/pipermail/libabigail/2022q1/004179.html.
>

I've thought a little bit more about this one.

The current check is "local", recursion prevention at *this* DIE
prevents it from being canonicalised, but it could still depend on
child DIEs that are actually also parent DIEs and whose processing has
not yet been completed.

Would it be safer (more precise) to inhibit on-the-fly
canonicalisation whenever the set (stack) of aggregates being compared
is non-empty?

Giuliano.

> Thanks.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 31 Dodji Seketeli 2022-03-02 13:34:14 UTC

Dodji Seketeli via Libabigail <libabigail@sourceware.org> a écrit:

[...]


> This addresses https://sourceware.org/bugzilla/show_bug.cgi?id=26646#c21.
>
> 	* src/abg-dwarf-reader.cc (compare_dies): Do not propagate
> 	canonical type when aggregate redundancy is detected.
>

[...]

gprocida at google dot com via Libabigail <libabigail@sourceware.org> a
écrit:

[...]

> I've thought a little bit more about this one.
>
> The current check is "local", recursion prevention at *this* DIE
> prevents it from being canonicalised, but it could still depend on
> child DIEs that are actually also parent DIEs and whose processing has
> not yet been completed.

You are right.  The current state is an improvement, but it's not
"complete".  Some DIEs might still wrongly considered as being
equivalent just because they depend on a "recursive DIE" that was
detected as such.  The canonical DIE propagation might have happened
during the window of time where the recursive DIE was comparing equal to
its counterpart.

This is addressed in the IR type canonicalization algorithm in
abg-ir.cc.
To learn more about this, look for
/// @defgroup OnTheFlyCanonicalization On-the-fly Canonicalization

and that comment.

The implementation is scattered in the functions
return_comparison_result, maybe_propagate_canonical_type and in the
macro RETURN_TRUE_IF_COMPARISON_CYCLE_DETECTED.

More on this below ...

> Would it be safer (more precise) to inhibit on-the-fly
> canonicalisation whenever the set (stack) of aggregates being compared
> is non-empty?

Right, that is the obvious thing to do, as I thought.  But then the
problem I encountered, when doing this for IR types, was that things
were too slow.  Really really slow.  In the time window where the
canonical type DIE is inhibited, the amount of (quadratic) structural
DIE comparisons goes through the roof.  That is why I came up with the
canonical type propagation in the first place.

So, the other (harder) approach I've taken is to not disable the
on-the-fly canonicalization when the stack of aggregates being compared
is non-empty, but rather, keep track of the types which are resolved as
being equivalent, but that depend on recursive aggregate that were
detected as such.

I call these types "dependent types".  Let's consider one such dependent
type D.  D's canonical type is the result of canonical propagation that
happened as the recursive type (that D depends on) was on the stack of
aggregates being compared.  D is labelled as having a "non-confirmed"
canonical type.  If the recursive type it depends on is later considered
not being equivalent to its counterpart, then the non-confirmed
canonical of D is going to be "cancelled" and then D is going to be
considered as being non canonicalized.

This is done for D and all the types that depends on it.

By doing this, the number of non canonicalized types is reduced to its
absolute minimum and so comparisons are reasonably fast, even for an
applications like the Kernel or Gimp.  Libraries usually have smaller
type graphs so we don't usually see the speed issue there.  Unless it's
llvm.so or libwebkit.so ;-)

So, I wasn't going to get into doing this for DIEs right away because it
takes a lot of time doing careful coding/debugging/profiling cycles.

But I definitely think we'll need to do this to have perfectly precise
canonicalizer.  My point was to get this in as it's an improvement
already and then work on the subsequent patch with a cooler head.

Does that make any sense?

Thanks.

Comment 32 gprocida 2022-03-02 22:36:23 UTC

(Not sure what's going on with the addresses, I've added back libabigail here.)

Hi.

On Wed, 2 Mar 2022 at 13:34, dodji at seketeli dot org
<sourceware-bugzilla@sourceware.org> wrote:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=26646
>
> --- Comment #31 from dodji at seketeli dot org ---
> Dodji Seketeli via Libabigail <libabigail@sourceware.org> a écrit:
>
> [...]
>
>
> > This addresses https://sourceware.org/bugzilla/show_bug.cgi?id=26646#c21.
> >
> >       * src/abg-dwarf-reader.cc (compare_dies): Do not propagate
> >       canonical type when aggregate redundancy is detected.
> >
>
> [...]
>
> gprocida at google dot com via Libabigail <libabigail@sourceware.org> a
> écrit:
>
> [...]
>
> > I've thought a little bit more about this one.
> >
> > The current check is "local", recursion prevention at *this* DIE
> > prevents it from being canonicalised, but it could still depend on
> > child DIEs that are actually also parent DIEs and whose processing has
> > not yet been completed.
>
> You are right.  The current state is an improvement, but it's not
> "complete".  Some DIEs might still wrongly considered as being
> equivalent just because they depend on a "recursive DIE" that was
> detected as such.  The canonical DIE propagation might have happened
> during the window of time where the recursive DIE was comparing equal to
> its counterpart.
>
> This is addressed in the IR type canonicalization algorithm in
> abg-ir.cc.
> To learn more about this, look for
> /// @defgroup OnTheFlyCanonicalization On-the-fly Canonicalization
>
> and that comment.
>
> The implementation is scattered in the functions
> return_comparison_result, maybe_propagate_canonical_type and in the
> macro RETURN_TRUE_IF_COMPARISON_CYCLE_DETECTED.
>
> More on this below ...
>
> > Would it be safer (more precise) to inhibit on-the-fly
> > canonicalisation whenever the set (stack) of aggregates being compared
> > is non-empty?
>
> Right, that is the obvious thing to do, as I thought.  But then the
> problem I encountered, when doing this for IR types, was that things
> were too slow.  Really really slow.  In the time window where the
> canonical type DIE is inhibited, the amount of (quadratic) structural
> DIE comparisons goes through the roof.  That is why I came up with the
> canonical type propagation in the first place.
>
> So, the other (harder) approach I've taken is to not disable the
> on-the-fly canonicalization when the stack of aggregates being compared
> is non-empty, but rather, keep track of the types which are resolved as
> being equivalent, but that depend on recursive aggregate that were
> detected as such.
>
> I call these types "dependent types".  Let's consider one such dependent
> type D.  D's canonical type is the result of canonical propagation that
> happened as the recursive type (that D depends on) was on the stack of
> aggregates being compared.  D is labelled as having a "non-confirmed"
> canonical type.  If the recursive type it depends on is later considered
> not being equivalent to its counterpart, then the non-confirmed
> canonical of D is going to be "cancelled" and then D is going to be
> considered as being non canonicalized.
>
> This is done for D and all the types that depends on it.
>
> By doing this, the number of non canonicalized types is reduced to its
> absolute minimum and so comparisons are reasonably fast, even for an
> applications like the Kernel or Gimp.  Libraries usually have smaller
> type graphs so we don't usually see the speed issue there.  Unless it's
> llvm.so or libwebkit.so ;-)
>
> So, I wasn't going to get into doing this for DIEs right away because it
> takes a lot of time doing careful coding/debugging/profiling cycles.
>
> But I definitely think we'll need to do this to have perfectly precise
> canonicalizer.  My point was to get this in as it's an improvement
> already and then work on the subsequent patch with a cooler head.
>
> Does that make any sense?
>

I think it makes some sense, but it would take me some time to read
through, digest and reason about this properly.

Instead... I'm going to advertise my comparison algorithm again (for
which I've already done all the hard thinking and testing). :-) I'm
not sure how directly applicable it is to canonicalisation, but there
is the *potential* to eliminate all redundant comparisons.

Problem statement: How to decide if two DIEs (let's call them nodes)
are equivalent, when nodes can depend on other nodes recursively and
with cycles, without comparing each pair of nodes more than once.

Nodes can have attributes (like kind (struct, union), size etc.) and
labelled edges (like member x of type y, pointer to type z etc.).
Equivalence means that node attributes are equivalent, nodes have the
same edge labels and following matching labelled edges finds
equivalent nodes.

The C and C++ type systems have various wrinkles, but the overall
structure of equivalence testing involves this kind of matching of
attributes and following edges to other nodes.

Equivalence of two nodes means we cannot find any difference, no
matter at what depth we decide to terminate recursion.

The algorithm has the following components.

1. a "known results" map from comparisons (node pairs) to equivalence
results (bool) - I think the equivalent thing in libabigail may be
on-the-fly canonicalisation itself, but this only stores true
equivalences for a subset of types and it may be better to cache all
outcomes; this needs to be part of reader_context
2. the path-based strongly-connected component algorithm factored into
2 helper functions and associated shared state (which should also be
part of the reader context) - these replace the
aggregates_being_compared data structure and its access functions
3. a comparison function that uses the data structures and helper
functions at the beginning and end with per-kind comparison logic
sandwiched in the middle (this is compare_dies)

If no comparison cycles are possible then a simple DFS works, however,
it's good to cache computed comparisons to avoid repeated work (in
case the DFS traverses a DAG rather than a tree). Such a comparison
function looks like this:

1. Check for a known result, and if present return it (this is a bit
like a check for an existing canonical DIE but additionally handles
the case where the DIEs are already known to be non-equivalent).
2. [doesn't exist yet]
3. Now do the actual comparison work updating result to false if a
difference is found (exactly as at present).
4. [doesn't exist yet]
5. Insert (comparison, result) into "known results".
6. Return result.

If cycles are possible, this DFS will go into an infinite loop (and
crash with a stack overflow). The comparison function needs to be
updated:

1. Check for a known result, and if present return it (this is a bit
like a check for an existing canonical DIE but additionally handles
the case where the DIEs are already known to be non-equivalent).
2. Call the first SCC helper to see if the comparison is already in
progress (this is a bit like the aggregates-being-compared check).
2a. If the SCC helper says the comparison is progress, return a
tentative "true" (equals) result (this is very similar to what happens
for aggregates, but it can be done for all kinds of DIE).
2b. Otherwise the SCC helper has returned an index which *must* be
used before returning from the comparison function.
3. Now do the actual comparison work updating result to false if a
difference is found (exactly as at present - though I did notice one
stray early return false which needs to be changed to result = false).
4. Pass the index to the second SCC helper function which returns a
set of comparisons.
4a. The empty set implies that comparisons are still in progress.
4b. A non-empty set means a strongly-connected component has been
found and all of its comparisons have completed.
5. Insert (comparison, result) into "known results" for each
comparison in the set.
6. Return result.

I'm not exactly sure what this means in terms of
on-the-fly-canonicalisation. I tried adapting the code at the end of
compare_dies that performs canonicalisation but something broke
somewhere. I also tried skipping on-the-fly canonicalisation
altogether but that didn't work either. So, just dropping in the SCC
helper and moving a small amount of code around doesn't quite work.
It's possible I need to fix how I got back from DIE offsets to DIEs.

I've noted similarities between the SCC approach and what you have in
place already. The SCC approach guarantees no repeated comparisons, so
it would be really great if it could be applied to the DWARF reader.

My SCC helper code is here:
https://cs.android.com/android/kernel/superproject/+/build-tools:external/stg/scc.h

A concrete comparison algorithm using the SCC helper is here (it
actually builds ABI diffs in this case, but that extra logic can be
ignored):
https://cs.android.com/android/kernel/superproject/+/build-tools:external/stg/stg.cc;drc=3b6b65e1a9190dab2eabb98701a716a895b8a7b1;l=366

I'll post a cleaned-up version of my attempt to integrate this code
tomorrow with notes about where the test suite blows up.

Cheers,
Giuliano.

> Thanks.
>
> --
> You are receiving this mail because:
> You reported the bug.

Comment 33 Dodji Seketeli 2022-03-03 11:43:59 UTC

Giuliano Procida via Libabigail <libabigail@sourceware.org> a écrit:


>> Does that make any sense?
>>
>
> I think it makes some sense, but it would take me some time to read
> through, digest and reason about this properly.
>
> Instead... I'm going to advertise my comparison algorithm again (for
> which I've already done all the hard thinking and testing). :-) I'm
> not sure how directly applicable it is to canonicalisation, but there
> is the *potential* to eliminate all redundant comparisons.

OK, I think this would be whole project in itself, as far as I am
concerned.  Right now, I am interested in fixing this issue in a 'best
effort' mode, keeping most of the existing infrastructure for the sake
of consistency and limiting the unintended impacts, release a 2.1
tarball and then we can talk about this further if you like.

I am not ditching what you are saying.  Just that this place (patch
review and this bug report) doesn't seem like the best place for this
right now.

Rather, a new comparison algorithm altogether being a project in itself
(IMHO), it'd be better to keep this discussion under its own "feature
request bug report", I'd say.

I understand if you don't have time to review what I am proposing.  I'll
thus go ahead and look at your proposal a bit later.

Thanks.

Comment 34 gprocida 2022-03-03 13:32:46 UTC

Hi.

compare_dies is almost exactly the right shape to be able to drop in the more sophisticated cycle-handling logic (a.k.a. SCC finder) and I've posted an RFC patch to show the small changes needed there and also to note that things don't quite work, as mentioned already.

If had a better idea as to what compute_ and get_canonical_die_offset are doing, I might be able to fix things myself, but I don't.

However, if you feel this is a separate project for another time, that's fair enough.

In particular, I have no expectation that dropping in the SCC finder and getting compare_dies working with it will do anything other than make the aggregates-being-compared logic more precise and complete. I don't think it will address this (declaration-only types) bug report in general.

Regards,
Giuliano.

Comment 35 gprocida 2022-06-08 15:43:12 UTC

I have bisected what I believe to be a (rare) exponential slowdown to this commit: https://sourceware.org/git/?p=libabigail.git;a=commit;h=bfd557040376d4e82e4e684f3745f1e55bafdd9b

The same pairs of DIEs get compared 100 000s of times (at least) and abidw does not terminate in any reasonable amount of time.

This behaviour is consistent across build environments and compilers and I have verified it on vanilla upstream libabigail at commit ebf8f98f3dc7b6e4fe2af9c1ccbbfee55ace6c54.

The test case is 200MB compressed, unfortunately, much of which is vmlinux. The bad behaviour kicks in when processing reaches a certain kernel module, as far as I could tell using strace.

I'm looking into what I can do make the test case available to you. There may be some complications here.