Infinite Stack Unwinding ARM

Johannes Stoelp
Tue Apr 4 12:54:00 GMT 2017

Hi folks,

While debugging Linux kernel code on an ARMv8 target we ended up in an "infinite" loop when printing
the backtrace. We tracked it down that this happens when we break in any function origin in
el[01]_irq (just the ones we ran into).
Note: These functions (el[01]_irq) are handwritten assembler functions and therefore gdb uses the
prologue analyzer (gdb/aarch64-tdep.c:aarch64_analyze_prologue(...)) to unwind the previous frame.
We further analyzed the stack unwinding and found the root cause of this infinite loop.

Let's look at el1_irq for example to see what is happening in detail. While unwinding the stack,
gdb comes to the 'el1_irq' function and then starts analyzing the function prologue to further
unwind the stack.

gdb calculates the begin and end of the function prologue as follows:
   start_prologue = 0xffffffc000083c80
   end_prologue   = 0xffffffc000083cd4
When the analyzer hits the first MRS instruction at 0xffffffc000083cc4 it stops because MRS is not supported in the analyzer.

/* el1_irq disassembly start */
Dump of assembler code for function el1_irq (arch/arm64/kernel/entry.S):
start_prologue --> 0xffffffc000083c80 <+0>:     sub     sp, sp, #0x30
                    0xffffffc000083c84 <+4>:     stp     x28, x29, [sp,#-16]!
                    0xffffffc000083c88 <+8>:     stp     x26, x27, [sp,#-16]!
                    0xffffffc000083c8c <+12>:    stp     x24, x25, [sp,#-16]!
                    0xffffffc000083c90 <+16>:    stp     x22, x23, [sp,#-16]!
                    0xffffffc000083c94 <+20>:    stp     x20, x21, [sp,#-16]!
                    0xffffffc000083c98 <+24>:    stp     x18, x19, [sp,#-16]!
                    0xffffffc000083c9c <+28>:    stp     x16, x17, [sp,#-16]!
                    0xffffffc000083ca0 <+32>:    stp     x14, x15, [sp,#-16]!
                    0xffffffc000083ca4 <+36>:    stp     x12, x13, [sp,#-16]!
                    0xffffffc000083ca8 <+40>:    stp     x10, x11, [sp,#-16]!
                    0xffffffc000083cac <+44>:    stp     x8, x9, [sp,#-16]!
                    0xffffffc000083cb0 <+48>:    stp     x6, x7, [sp,#-16]!
                    0xffffffc000083cb4 <+52>:    stp     x4, x5, [sp,#-16]!
                    0xffffffc000083cb8 <+56>:    stp     x2, x3, [sp,#-16]!
                    0xffffffc000083cbc <+60>:    stp     x0, x1, [sp,#-16]!
                    0xffffffc000083cc0 <+64>:    add     x21, sp, #0x120
   analyze_stop --> 0xffffffc000083cc4 <+68>:    mrs     x22, elr_el1
                    0xffffffc000083cc8 <+72>:    mrs     x23, spsr_el1
                    0xffffffc000083ccc <+76>:    stp     x30, x21, [sp,#240]
                    0xffffffc000083cd0 <+80>:    stp     x22, x23, [sp,#256]
   end_prologue --> 0xffffffc000083cd4 <+84>:    msr     daifclr, #0x8
                    0xffffffc000083cd8 <+88>:    ldr     x1, 0xffffffc000084210 <handle_arch_irq>
                    0xffffffc000083cdc <+92>:    mov     x0, sp
                    0xffffffc000083ce0 <+96>:    blr     x1
                    0xffffffc000083ce4 <+100>:   mov     x28, sp
/* el1_irq disassembly end */

The analysis doesn't see the store of the link register x30 (0xffffffc000083ccc) and therefore
assumes the previous value of the link register is still valid.
The link register actually has the value of
   x30 = 0xffffffc000083ce4 (pointing into el1_irq)
and the unwinder then uses this value to determine the caller of el1_irq, which oh wonder is again el1_irq.

The other thing is that gdb doesn't recognize this situation as "Backtrace stopped: previous frame
identical to this frame (corrupt stack?)". This in turn happens because the symbolic calculation
for the stack pointer register looks as following after the prologue analyzer stops:
   (gdb) p/x regs[AARCH64_SP_REGNUM]
   $3 = {
      kind = 0x2,               // pv_register
      reg = 0x1f,               // 31
      k = 0xfffffffffffffee0    // #-0x120

gdb then tries to use this stack pointer as frame pointer.

/* gdb/aarch64-tdep.c:aarch64_analyze_prologue(...) start */
   if (pv_is_register (regs[AARCH64_FP_REGNUM], AARCH64_SP_REGNUM)) {
      /* Frame pointer is fp.  Frame size is constant.  */
      cache->framereg = AARCH64_FP_REGNUM;
      cache->framesize = -regs[AARCH64_FP_REGNUM].k;
   else if (pv_is_register (regs[AARCH64_SP_REGNUM], AARCH64_SP_REGNUM)) {
      /* Try the stack pointer.  */
      cache->framesize = -regs[AARCH64_SP_REGNUM].k;
      cache->framereg = AARCH64_SP_REGNUM;
   else {
      /* We're just out of luck.  We don't know where the frame is.  */
      cache->framereg = -1;
      cache->framesize = 0;
/* gdb/aarch64-tdep.c:aarch64_analyze_prologue(...) end */

Which in turn is used for calculating the previous stack pointer.

/* gdb/aarch64-tdep.c:aarch64_make_prologue_cache_1 (...) start */
   aarch64_scan_prologue (this_frame, cache);

   if (cache->framereg == -1)

   unwound_fp = get_frame_register_unsigned (this_frame, cache->framereg);
   if (unwound_fp == 0)
   cache->prev_sp = unwound_fp + cache->framesize;
/* gdb/aarch64-tdep.c:aarch64_make_prologue_cache_1 (...) end */

The result of this is that gdb interprets the situation as a recursion.

* Anyone ran into similar situations with the arm prologue analyzer?
* Anyone worked on an extension for the prologue analyzer to support SYSRegs and therefore
  instructions like MRS?
* A defensive workaround I'm currently using is to stop the stack unwinding when hitting an
  unsupported instruction.


More information about the Gdb mailing list