Summary: | SystemTap userspace marker in shared libraries cause probed program crash | ||
---|---|---|---|
Product: | systemtap | Reporter: | William Cohen <wcohen> |
Component: | runtime | Assignee: | Unassigned <systemtap> |
Status: | RESOLVED FIXED | ||
Severity: | critical | CC: | scox |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Bug Depends on: | |||
Bug Blocks: | 10907 | ||
Attachments: |
A short systemtap script to trigger the problem
Script to find the address of the sigbus faults Output of script with -DDEBUG_TASK_FINDER_VMA scox's proposed patch to address this issue |
Description
William Cohen
2010-01-05 21:44:40 UTC
Created attachment 4495 [details]
A short systemtap script to trigger the problem
This script is just suppose to print out information and should not interfere
with the operation of a python program running in userspace.
When attempting the run /usr/bin/python under gdb get the following: $ gdb /usr/bin/python GNU gdb (GDB) Fedora (7.0.1-19.fc12) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/bin/python...Reading symbols from /usr/lib/debug/usr/bin/python.debug...done. done. (gdb) run Starting program: /usr/bin/python During startup program terminated with signal SIGBUS, Bus error. (gdb) Created attachment 4496 [details]
Script to find the address of the sigbus faults
To allow comparison between runs address randomization was turned off with:
sysctl -w kernel.randomize_va_space=0
The attached sigbus.stp script was run:
$ stap /tmp/sigbus.stp |grep sig
2735979:32542:0x00007ffff7fc965c:r:sigbus:2
2736032:32542:0x00007ffff7fc965e:r:sigbus:2
Got a pmap of a python process with the problem script not running.
00007ffff7c22000 1464K r-x-- /usr/lib64/libpython2.6.so.1.0
00007ffff7d90000 2044K ----- /usr/lib64/libpython2.6.so.1.0
00007ffff7f8f000 236K rw--- /usr/lib64/libpython2.6.so.1.0
00007ffff7fca000 56K rw--- [ anon ]
Those addresses appear to be close to the end of the rw region of
/usr/lib64/libpython2.6.so.1.0
$ nm /usr/lib/debug//usr/lib64/libpython2.6.so.1.0.debug |grep sema
00000000003a765c B function__entry_semaphore
00000000003a765e B function__return_semaphore
The problem appears to be the code writing to
function__entry_semaphore and function__return_semaphore is causing
the sigbus error. The number of faults is equal to the number of
probes on a point.
From the sigbus.stp script output it looks like a get_user() macro is causing
the sigbus.
Created attachment 4500 [details] Output of script with -DDEBUG_TASK_FINDER_VMA Ran the script with following command line to get idea what is going on in taskfinder: /home/wcohen/research/profiling/pytrace.stp -DDEBUG_TASK_FINDER_VMA -c python >& /tmp/x Created attachment 4502 [details]
scox's proposed patch to address this issue
This is the proposed patch that scox developed to address the problem of the
get_user() occurring at the wrong time. The userspace markers in python works
with this patch.
The old implementation of userspace probing used __access_process_vm() in the runtime access_process_vm.h. This function used copy_to_user_page(). On x86 machine cache consistency is hardware enforced. The resulting copy_to_user_page() ends up be just a memcpy(). However, On the ia64 (and powerpc) the code needs to take some additional steps to make sure the cache and memory are consistent. This results in the ia64 (and powerpc) copy_to_user_page including a flush_icache_user_range(): #define copy_to_user_page(vma, page, vaddr, dst, src, len) \ do { memcpy(dst, src, len); \ flush_icache_user_range(vma, page, vaddr, len); \ } while (0) There is some discussion about the copy_to_user_page() and cache coherence at: http://rhkernel.org/RHEL5+2.6.18-8.el5/Documentation/cachetlb.txt#L352 (In reply to comment #6) > However, On the ia64 (and powerpc) the code needs to take some additional steps > to make sure the cache and memory are consistent. But note that this should not apply to our data-space semaphore accesses. We could fork another version of __access_process_vm that uses memcpy instead of copy_to_user_page(). Use __access_process_vm_noflush for static user semaphore decrement. commit: 3c5b8e2b99 |