Found by valgrind: ==173857== 64 bytes in 2 blocks are definitely lost in loss record 3,155 of 8,232 ==173857== at 0x480B7BB: malloc (vg_replace_malloc.c:380) ==173857== by 0x90143DC: pc_record (in /path/to/python_extension_module.cpython-38-x86_64-linux-gnu.so) ==173857== by 0x9019ABC: walk_children (in /path/to/python_extension_module.cpython-38-x86_64-linux-gnu.so) ==173857== by 0x901974A: __libdw_visit_scopes (in /path/to/python_extension_module.cpython-38-x86_64-linux-gnu.so) ==173857== by 0x9019A69: walk_children (in /path/to/python_extension_module.cpython-38-x86_64-linux-gnu.so) ==173857== by 0x901974A: __libdw_visit_scopes (in /path/to/python_extension_module.cpython-38-x86_64-linux-gnu.so) ==173857== by 0x9014691: dwarf_getscopes (in /path/to/python_extension_module.cpython-38-x86_64-linux-gnu.so) `dwarf_getscopes` ends with: ``` if (result > 0) *scopes = a.scopes; return result; ``` but this is incorrect, since `a.scopes` may be non-NULL even if `result` is <= 0 and is leaked in this case since no reference is retained to it. Seems like this needs to be: ``` if (result > 0) *scopes = a.scopes; else free(a.scopes); return result; ```
Thanks. That seems correct. If in the first visit a.scopes was allocated, but an error occurs later or in the second visit a.scopes isn't freed. Except in the case of a realloc failure. It would make sense to also not dealloc in that failure case so we can always free on failure before we return the error as you suggest. BTW. Do you know what why result was negative (I assume it was negative, it could in theory also be zero, but that would be surprising if that would cause the leak)?
> it could in theory also be zero In this case, the result was 0 (I am working with Matt)
(In reply to Pablo Galindo Salgado from comment #2) > > it could in theory also be zero > > In this case, the result was 0 (I am working with Matt) Interesting, I assumed it was on a failure path. My proposed patch would have been: diff --git a/libdw/dwarf_getscopes.c b/libdw/dwarf_getscopes.c index 5662eecf..676d62f3 100644 --- a/libdw/dwarf_getscopes.c +++ b/libdw/dwarf_getscopes.c @@ -100,7 +100,7 @@ origin_match (unsigned int depth, struct Dwarf_Die_Chain *die, void *arg) Dwarf_Die *scopes = realloc (a->scopes, nscopes * sizeof scopes[0]); if (scopes == NULL) { - free (a->scopes); + /* a->scopes will be freed by dwarf_getscopes on error. */ __libdw_seterrno (DWARF_E_NOMEM); return -1; } @@ -198,6 +198,8 @@ dwarf_getscopes (Dwarf_Die *cudie, Dwarf_Addr pc, Dwarf_Die **scopes) if (result > 0) *scopes = a.scopes; + else if (result < 0) + free (a.scopes); return result; } But if the result is zero I don't believe I fully understand yet how the leak happens.
The first result is already 0 in int result = __libdw_visit_scopes (0, &cu, NULL, &pc_match, &pc_record, &a); I think this is because walk_children finds no "real children" so it returns with the last return (ret < 0 ? -1 : 0;) every single time.
(In reply to Mark Wielaard from comment #3) > But if the result is zero I don't believe I fully understand yet how the > leak happens. So I believe this may happen if we found an inlined subroutine with pc_record, but then don't find the abstract origin DIE with origin_match. That does feels like it should be an error, but (currently) isn't. Maybe this happens if the reference to the abstract origin is in a partial unit that isn't imported into the current CU. It would be really nice to have an example of the DIE tree when this happens.
What would be the best way to provide that?
The dwarf trie comes from /usr/lib64/libc-2.17.so in a RHEL7 system
If you give me a patch adding print statements to libdw/dwarf_getscopes.c and libdw/libdw_visit_scopes.c I can give you the output.
Created attachment 14302 [details] dwarf_getscopes debug trace Maybe the following would give us a clue about what is really happening. It should print the DIE offsets that are being visited.
(In reply to Pablo Galindo Salgado from comment #8) > If you give me a patch adding print statements to libdw/dwarf_getscopes.c > and libdw/libdw_visit_scopes.c I can give you the output. See attachment 14302 [details] on comment #9.
Unfortunately I cannot easily patch dwarf_getscopes and run the experiment :(
Hopefully I can report back with the output in a few days. Apologies for the delay!
Hummmmm, it seems that I cannot attach the output because is 87 MB of output.
I can tell you that the file finishes with: pc_record => [5d9f] pc_record <= [5d9f] 3 __libdw_visit_scopes result 3, scopes 0xa689d0, inlined 0, nscopes 3 :)
Here is the parts that reference the parts: __libdw_visit_scopes for pc 0xf5b23, cudie [29ce89] __libdw_visit_scopes result 0, scopes (nil), inlined 0, nscopes 0 __libdw_visit_scopes for pc 0x252208, cudie [b] __libdw_visit_scopes result 0, scopes 0x1a761b0, inlined 2, nscopes 3 __libdw_visit_scopes for origin_match 0x252208, cudie [b] __libdw_visit_scopes origin_match result 0, scopes 0x1a761b0, inlined 2, nscopes 3 __libdw_visit_scopes for pc 0x13dd69, cudie [b] __libdw_visit_scopes result 2, scopes 0x1b83810, inlined 0, nscopes 2 __libdw_visit_scopes for pc 0x131d71, cudie [b] __libdw_visit_scopes result 0, scopes 0x1a94580, inlined 7, nscopes 2 __libdw_visit_scopes for origin_match 0x131d71, cudie [b] __libdw_visit_scopes origin_match result 0, scopes 0x1a94580, inlined 7, nscopes 2 __libdw_visit_scopes for pc 0x13e992, cudie [b] __libdw_visit_scopes result 0, scopes 0x1b49b40, inlined 2, nscopes 1 __libdw_visit_scopes for origin_match 0x13e992, cudie [b] __libdw_visit_scopes origin_match result 0, scopes 0x1b49b40, inlined 2, nscopes 1 __libdw_visit_scopes for pc 0x12d6cb, cudie [b] __libdw_visit_scopes result 0, scopes 0x1ad8ba0, inlined 7, nscopes 2 __libdw_visit_scopes for origin_match 0x12d6cb, cudie [b] __libdw_visit_scopes origin_match result 0, scopes 0x1ad8ba0, inlined 7, nscopes 2 __libdw_visit_scopes for pc 0x12c246, cudie [b] __libdw_visit_scopes result 0, scopes 0x1a21b70, inlined 2, nscopes 1 __libdw_visit_scopes for origin_match 0x12c246, cudie [b] __libdw_visit_scopes origin_match result 0, scopes 0x1a21b70, inlined 2, nscopes 1 __libdw_visit_scopes for pc 0x12bef0, cudie [b] __libdw_visit_scopes result 2, scopes 0x1e44600, inlined 0, nscopes 2 __libdw_visit_scopes for pc 0x1e2e22, cudie [b] __libdw_visit_scopes result 0, scopes 0x1c0aa70, inlined 2, nscopes 1 __libdw_visit_scopes for origin_match 0x1e2e22, cudie [b] __libdw_visit_scopes origin_match result 0, scopes 0x1c0aa70, inlined 2, nscopes 1 __libdw_visit_scopes for pc 0x20ca3c, cudie [b] __libdw_visit_scopes result 2, scopes 0x1de8920, inlined 0, nscopes 2 __libdw_visit_scopes for pc 0x1ff6aa, cudie [b] __libdw_visit_scopes result 2, scopes 0x1ccc3f0, inlined 0, nscopes 2 __libdw_visit_scopes for pc 0x8d1a9, cudie [b] __libdw_visit_scopes result 2, scopes 0x1eb3c20, inlined 0, nscopes 2 __libdw_visit_scopes for pc 0x1faad0, cudie [b] __libdw_visit_scopes result 0, scopes 0x1eb18b0, inlined 2, nscopes 2 __libdw_visit_scopes for origin_match 0x1faad0, cudie [b] __libdw_visit_scopes origin_match result 0, scopes 0x1eb18b0, inlined 2, nscopes 2 __libdw_visit_scopes for pc 0x1f4538, cudie [b] __libdw_visit_scopes result 0, scopes 0x1eaad70, inlined 4, nscopes 2 __libdw_visit_scopes for origin_match 0x1f4538, cudie [b] __libdw_visit_scopes origin_match result 0, scopes 0x1eaad70, inlined 4, nscopes 2 __libdw_visit_scopes for pc 0x1d61a8, cudie [b] __libdw_visit_scopes result 2, scopes 0x1ef26b0, inlined 0, nscopes 2 __libdw_visit_scopes for pc 0x22554, cudie [22ec] __libdw_visit_scopes result 3, scopes 0x1917ca0, inlined 0, nscopes 3
Sorry this is a bit difficult to debug. Would you be able to show the start of the log? It should show how you are calling dwarf_getscopes. I think I have found the RHEL7 glibc debug file. So I can match up some of the DIE offsets, but I have some trouble understanding what is going on. It would also be helpful to see the other log statements in dwarf_getscopes to understand what the first and second calls to __libdw_visit_scopes have as arguments and return value.
So the code here changed a little with this patch: commit b7c7d8776ed46e2237d18fb15c6b72e83cfa259b Author: Mark Wielaard <mark@klomp.org> Date: Sun Jan 22 00:31:57 2023 +0100 libdw: Search for abstract origin in the correct CU With gcc -flto the abstract origin of an inlined subroutine could be in a different CU. dwarf_getscopes might return an empty scope if it cannot find the abstract origin scope. So make sure to search in the We also tried to add the origin match in pc_record directly in the current inlined scope. This always failed, causing to do a needless traversal, followed by the full CU scan in dwarf_getscopes. Just always stop the pc_record search and then do the CU origin_match in dwarf_getscopes. Signed-off-by: Mark Wielaard <mark@klomp.org> Which makes the condition of the first check slightly different: - if (result == 0 && a.scopes != NULL) - result = __libdw_visit_scopes (0, &cu, NULL, &origin_match, NULL, &a); + if (result >= 0 && a.scopes != NULL && a.inlined > 0) + { + /* We like the find the inline function's abstract definition + scope, but that might be in a different CU. */ + cu.die = CUDIE (a.inlined_origin.cu); + result = __libdw_visit_scopes (0, &cu, NULL, &origin_match, NULL, &a); + } So with that I think my proposed patch in comment #3 might work. But I have not been able to replicate the issue. So cannot easily check.
I could replicate a leak when there was an error looking up the inlined scopes, so I submitted the patch to fix that: https://inbox.sourceware.org/elfutils-devel/20230222223901.1089881-1-mark@klomp.org/T/#u Hopefully that also fixes the original issue (which I have been unable to replicate).
I hope this (plus the other changes to dwarf_getscopes) will fix the original issue too. It should be part of 0.189 (which should be released this week). If it didn't solve the issue please feel free to reopen this bug or file a new one. commit e24d8a4a3ea106608bb3e8d33c4639cf71d0f08d Author: Mark Wielaard <mark@klomp.org> Date: Wed Feb 22 23:34:00 2023 +0100 libdw: Fix dwarf_getscopes memory leak on error When there is an error in dwarf_getscopes after the initial scopes have been allocated, e.g. when looking for the inlined scopes, then the scopes would leak. Fix this by explicitly free the scopes on error. https://sourceware.org/bugzilla/show_bug.cgi?id=29434 Signed-off-by: Mark Wielaard <mark@klomp.org>