How to debug broken unwinding?

Mark Wielaard mark@klomp.org
Wed Jun 14 18:27:00 GMT 2017


Hi Milian,

On Fri, 2017-06-02 at 17:03 +0200, Milian Wolff wrote:
> Some more debugging and going after my gut feeling brings me to the following 
> conclusion: The real issue seems to be the on-demand reporting of the elf 
> file. We used to do:
> 
>        Dwarf_Addr pc;
>        bool isactivation;
>  
>        if (!dwfl_frame_pc(state, &pc, &isactivation)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
> 
>        // report the module before we query for isactivation
>        report_module(pc, ui);
> 
> This looks safe and fine and actually works most of the time. But passing a 
> non-null isactivation flag to dwfl_frame_pc potentially leads to a second 
> unwind step, before we got the change to report the module! I can workaround 
> this by instead doing
> 
>        Dwarf_Addr pc;
>        bool isactivation;
>  
>        if (!dwfl_frame_pc(state, &pc, NULL)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
> 
>        // report the module before we query for isactivation
>        report_module(pc, ui);
> 
>        if (!dwfl_frame_pc(state, &pc, &isactivation)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
> 
> This fixes all the issues in my original email. So sorry for the noise - it 
> doesn't see as if the unwinding code in elfutils is broken - quite the 
> contrary! It's just our misuse of the API that is to blame.

That is some very nice detective work.
The isactivation code is tricky. The logic is:

isactivation = (frame is initial_frame
                || frame is signal_frame
		|| unwind_frame is signal_frame)

The idea being that normally a frame isn't an activation.
So the given pc is really the return address, not the call address.
But if it is the initial frame or a frame caused by a signal, then the
pc really is where the code was when this frame was active.

The tricky case is that last case. If the previous frame is a signal
frame then this frame is also an activation (so you don't need to
subtract 1 to get the actual address that was executing). I am not sure
I fully understand anymore why that is. This must be a frame that is
itself not an initial or signal frame, but that is called (activated)
from a signal frame.

But assuming that is the correct semantics then your original
observation seems to explain what is going on. The code doesn't account
for the fact that during an unwind other modules can be reported that
might make it able to unwind the current frame. And it caches the
failure state of the unwind of the current frame, so it doesn't retry
when asked.

I think we could reset the state of the current frame if any more
modules are reported. But is it possible for your parser or perf to
report modules as soon as it knows about them, not lazily during an
unwind? Is reporting lazily at the last possible moment an optimization
that really helps? I would assume that you end up reporting all modules
anyway. So it actually seems easier to report everything upfront.

> May I suggest though to move the isactivation code into a separate function to 
> prevent this issue from arising in the future? I.e. it would be nice if the 
> code above could read:
> 
> 
>        Dwarf_Addr pc;
>        bool isactivation;
>  
>        if (!dwfl_frame_pc(state, &pc)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
> 
>        // report the module before we query for isactivation
>        report_module(pc, ui);
> 
>        if (!dwfl_frame_is_activation(state)) {
>            --pc; 
>        }

That is a good suggestion in any case. Maybe also introduce
dwfl_frame_is_signal_frame () and dwfl_frame_is_initial_frame () in case
the user wants to know the exact details.

Thanks,

Mark



More information about the Elfutils-devel mailing list