Bug 13714

Summary: panic when sampling backtrace() in timer.profile
Product: systemtap Reporter: brendan.gregg
Component: runtimeAssignee: Mark Wielaard <mark>
Status: RESOLVED FIXED    
Severity: normal CC: mark
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Make sure REG_STATE.cfa_is_expr is always set correctly add more sanity checks
x86 entry_64.S cfi fixup

Description brendan.gregg 2012-02-20 06:29:39 UTC
The kernel can panic when profiling stack traces on Fedora 16.

window1 (sample workload):

# while :; do tar cf /dev/null /usr; done

window2 (profiling):

# stap -D MAXTRACE=100 -D MAXSTRINGLEN=4096 -ve 'global s; probe timer.profile { s[backtrace()] <<< 1; } probe end { foreach (i in s+) { print_stack(i); printf("\t%d\n", @count(s[i])); } }'

This was tried about a dozen times, and paniced on four of them.

The system had kdump configured and collected a crash dump.  Some details:

# crash /var/crash/2012-02-19-23\:08/vmcore /usr/lib/debug/lib/modules/3.2.6-3.fc16.x86_64/vmlinux
[...]
      KERNEL: /usr/lib/debug/lib/modules/3.2.6-3.fc16.x86_64/vmlinux
    DUMPFILE: /var/crash/2012-02-19-23:08/vmcore
        CPUS: 1
        DATE: Sun Feb 19 15:04:50 2012
      UPTIME: 00:05:41
LOAD AVERAGE: 1.39, 0.78, 0.32
       TASKS: 72
    NODENAME: 9d219ce8-cf52-409f-a14a-b210850f3231
     RELEASE: 3.2.6-3.fc16.x86_64
     VERSION: #1 SMP Mon Feb 13 20:35:42 UTC 2012
     MACHINE: x86_64  (2393 Mhz)
      MEMORY: 1 GB
       PANIC: "[  341.694043] Oops: 0000 [#1] SMP " (check log for details)
         PID: 3922
     COMMAND: "tar"
        TASK: ffff880037878000  [THREAD_INFO: ffff88001783e000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)
[...]
crash> bt -a
PID: 3922   TASK: ffff880037878000  CPU: 0   COMMAND: "tar"
 #0 [ffff88003fc037b0] machine_kexec at ffffffff81038bda
 #1 [ffff88003fc03820] crash_kexec at ffffffff810b8482
 #2 [ffff88003fc038f0] oops_end at ffffffff815e2b08
 #3 [ffff88003fc03920] no_context at ffffffff815d6a05
 #4 [ffff88003fc03980] __bad_area_nosemaphore at ffffffff815d6bdd
 #5 [ffff88003fc039e0] bad_area_nosemaphore at ffffffff815d6c0f
 #6 [ffff88003fc039f0] do_page_fault at ffffffff815e55e6
 #7 [ffff88003fc03b00] page_fault at ffffffff815e20b5
    [exception RIP: get_uleb128+83]
    RIP: ffffffffa00773f3  RSP: ffff88003fc03bb8  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 0000000000000007  RSI: ffffffffffffffff  RDI: ffff88003fc03bf0
    RBP: ffff88003fc03bb8   R8: 0000000000000030   R9: 0000000000000007
    R10: 0000000000000001  R11: 0000000000000040  R12: ffff88003aae1088
    R13: ffff88001783ffd8  R14: 0000000000000000  R15: ffff88003aae1088
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #8 [ffff88003fc03bc0] compute_expr at ffffffffa007c63f [stap_ddfda8cbbf4994e727c76bf533732ecf_3874]
 #9 [ffff88003fc03c70] unwind_frame at ffffffffa007df6c [stap_ddfda8cbbf4994e727c76bf533732ecf_3874]
#10 [ffff88003fc03dc0] probe_1993 at ffffffffa007fcc6 [stap_ddfda8cbbf4994e727c76bf533732ecf_3874]
#11 [ffff88003fc03e40] enter_profile_probes at ffffffffa007b2c5 [stap_ddfda8cbbf4994e727c76bf533732ecf_3874]
#12 [ffff88003fc03e70] profile_tick at ffffffff81098ea9
#13 [ffff88003fc03e90] tick_sched_timer at ffffffff810a0e4e
#14 [ffff88003fc03ec0] __run_hrtimer at ffffffff81093e90
#15 [ffff88003fc03f20] hrtimer_interrupt at ffffffff8109480b
#16 [ffff88003fc03f90] smp_apic_timer_interrupt at ffffffff815ec589
#17 [ffff88003fc03fb0] apic_timer_interrupt at ffffffff815ea45e
--- <IRQ stack> ---
#18 [ffff88001783fed0] apic_timer_interrupt at ffffffff815ea45e
    RIP: ffffffffffffff10  RSP: 0000000000000203  RFLAGS: 00000010
    RAX: ffff88003fc13780  RBX: 00000000008c2ed0  RCX: 0000000000000000
    RDX: ffff88001783e000  RSI: 0000000000000008  RDI: ffff88001783ff78
    RBP: ffffffff8119691f   R8: 0000000000000001   R9: 0000000000000001
    R10: 0000000000000001  R11: ffffffff811839f2  R12: ffff88001783fef8
    R13: 00007fff549ffff0  R14: ffff88001783ff40  R15: ffffffff81196824
    ORIG_RAX: 000000000000fe2e  CS: ffffffff815e1e8d  SS: ffff88001783ff88
bt: WARNING: possibly bogus exception frame
crash> dis get_uleb128+83
0xffffffffa00773f3 <get_uleb128+83>:    movzbl (%rdx),%r10d

Version and system details:

# stap -V
Systemtap translator/driver (version 1.6/0.152 non-git sources)
Copyright (C) 2005-2011 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
enabled features: AVAHI LIBRPM LIBSQLITE3 NSS BOOST_SHARED_PTR
TR1_UNORDERED_MAP NLS
# uname -a
Linux 9d219ce8-cf52-409f-a14a-b210850f3231 3.2.6-3.fc16.x86_64 #1 SMP Mon Feb
13 20:35:42 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release 
Fedora release 16 (Verne)

This is Fedora as a guest running under KVM.  SystemTap was installed via yum.
Comment 1 Mark Wielaard 2012-02-20 14:26:42 UTC
After some poking (also needed to increase MAXMAPENTRIES) I could finally replicate it on 3.3.0-0.rc1.git6.1.fc17.x86_64 with systemtap-devel-1.7-1.fc17.x86_64

[ 1305.976762] stap_e93c7971f27dbaecad41f45add08f2ea_2319: systemtap: 1.7/0.152, base: ffffffffa0662000, memory: 4291data/40text/86ctx/2058net/244893alloc kb, probes: 2
[ 1309.808090] BUG: unable to handle kernel NULL pointer dereference at 0000000000000007
[ 1309.809029] IP: [<ffffffffa06620c4>] get_uleb128+0x54/0x80 [stap_e93c7971f27dbaecad41f45add08f2ea_2319]
Comment 2 Mark Wielaard 2012-02-20 22:08:21 UTC
Found the root cause of this issue. It happens when a DW_CFA operation that defines the CFA as dwarf expression is followed by a DW_CFA operation that (re)defined the CFA as register+offset. In that case we forgot the reset the REG_STATE.cfa_is_expr flag which made compute_expr() interpret the reg/offset as expr pointer (because they share their values in a union).

While adding more sanity checks to make sure we catch such issues I found what looks like bad CFI in the x86_64 kernel in common_interrupt (arch/x86/kernel/entry_64.S) which defines CFI "by hand" and has a CFI_DEF_CFA_REGISTER following a def_cfa_expression, which is invalid.
Comment 3 Mark Wielaard 2012-02-21 10:28:56 UTC
Created attachment 6233 [details]
Make sure REG_STATE.cfa_is_expr is always set correctly add more sanity checks

Testing the following patch on the systemtap runtime side.
Comment 4 Mark Wielaard 2012-02-21 12:50:07 UTC
Created attachment 6234 [details]
x86 entry_64.S cfi fixup

This is the kernel patch that fixes the cfi for common_interrupt on x86_64.
Comment 5 Mark Wielaard 2012-02-21 14:45:38 UTC
commit 64b0cff3bee6b00cb4193ed887439c66055f85b4
Author: Mark Wielaard <mjw@redhat.com>
Date:   Tue Feb 21 15:08:58 2012 +0100

    PR13714 - Make sure REG_STATE.cfa_is_expr is always set correctly.
    
    runtime/unwind.c (processCFI): Always set REG_STATE.cfa_is_expr and
    add new sanity checks to make sure the cfa definition rules are sane.
    
    Since the cfa expr pointer and cfa register/offset rule shared a union
    not setting REG_STATE.cfa_is_expr could result in compute_expr ()
    wrongly being called and using the register/offset as expr pointer.

Kernel patch send upstream:
https://lkml.org/lkml/2012/2/21/154

Note that with the systemtap runtime patch any wrong DW_CFA_def sequence is detected. So you don't need an updated kernel. You do need the systemtap runtime patch however also with a kernel that has the kernel-cfi-fix patch applied. There could be other (correct) CFI sequences that did trigger the bug.
Comment 6 brendan.gregg 2012-02-21 23:15:00 UTC
Great - thanks for the quick fix!
Comment 7 Jackie Rosen 2014-02-16 19:41:58 UTC Comment hidden (spam)