Bug 15219 - syscall.exp failures on RHEL5, RHEL6, and rawhide
Summary: syscall.exp failures on RHEL5, RHEL6, and rawhide
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: tapsets (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-01 17:53 UTC by David Smith
Modified: 2014-01-07 22:02 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
beta patch (1.04 KB, patch)
2013-06-05 17:57 UTC, David Smith
Details | Diff
beta 2 patch (1.53 KB, patch)
2013-06-06 18:02 UTC, David Smith
Details | Diff
anti-nested syscall patch (3.90 KB, patch)
2013-06-11 14:37 UTC, David Smith
Details | Diff
anti-nested syscall patch 2 (1.22 KB, patch)
2013-08-26 20:44 UTC, David Smith
Details | Diff
different anti-nested syscall patch (4.12 KB, patch)
2013-10-08 18:03 UTC, David Smith
Details | Diff
simpler patch (997 bytes, patch)
2013-10-31 17:24 UTC, David Smith
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Smith 2013-03-01 17:53:41 UTC
While investigating bug #15211, I realized that we were seeing some common failures in syscall.exp across kernel versions:

On RHEL5 (2.6.18-308.20.1.el5 x86_64) and RHEL6 (2.6.32-279.14.1.el6.x86_64), I see the following failures:

====
FAIL: 32-bit clock syscall
FAIL: 32-bit rt_signal syscall
FAIL: 32-bit timer syscall
====

On rawhide (3.8.0-0.rc6.git1.1.fc19.x86_64), I see the following failures:

====
FAIL: 32-bit clock syscall
FAIL: 32-bit timer syscall
====

From looking at systemtap.log, the failures all happen when the 32-bit compat version of syscalls call the "real" syscall. The systemtap tapset has trouble decoding the arguments.  For example:

timer: timer_settime (0, 1, UNKNOWN, 0xffff880014eeff30) = -22 (EINVAL)

That "UNKNOWN" value is coming from _struct_itimerspec_u(). Inside _struct_itimerspec_u(), we return "UNKNOWN" when _stp_copy_from_user() fails. In this case, _stp_copy_from_user() is failing because the memory it is reading really isn't user memory. The 32-bit compat version of timer_settime() copies the 'struct compat_itimerspec' pointer into a real 'struct itimerspec', then calls the real sys_timer_settime().

We may need to loosen up _stp_copy_from_user().
Comment 1 David Smith 2013-03-04 17:53:58 UTC
Here's compat_sys_timer_settime() from kernel/compat.c. Notice it converts the 'struct compat_itimerspec' to a 'struct itimerspec' (which is in kernel memory), calls 'set_fs(KERENL_DS), then calls the real syscall function, 'sys_timer_settime()'.

====
long compat_sys_timer_settime(timer_t timer_id, int flags,
			  struct compat_itimerspec __user *new,
			  struct compat_itimerspec __user *old)
{
	long err;
	mm_segment_t oldfs;
	struct itimerspec newts, oldts;

	if (!new)
		return -EINVAL;
	if (get_compat_itimerspec(&newts, new))
		return -EFAULT;
	oldfs = get_fs();
	set_fs(KERNEL_DS);
	err = sys_timer_settime(timer_id, flags,
				(struct itimerspec __user *) &newts,
				(struct itimerspec __user *) &oldts);
	set_fs(oldfs);
	if (!err && old && put_compat_itimerspec(old, &oldts))
		return -EFAULT;
	return err;
}
====

Here's our '_stp_copy_from_user()'. Notice we explicitly call 'set_fs(USER_DS)', which just overrode compat_sys_timer_settime() setting.

====
static unsigned long _stp_copy_from_user(char *dst, const char __user *src, unsigned long count)
{
	if (count) {
                mm_segment_t _oldfs = get_fs();
                set_fs(USER_DS);
                pagefault_disable();
		if (access_ok(VERIFY_READ, src, count))
			count = __copy_from_user_inatomic(dst, src, count);
		else
			memset(dst, 0, count);
                pagefault_enable();
                set_fs(_oldfs);
	}
	return count;
}
====

I'm thinking we should no longer change the kernel's idea of what memory space to use in _stp_copy_from_user().
Comment 2 Frank Ch. Eigler 2013-03-04 18:40:06 UTC
I believe the USER_DS protection still has value, for contexts where the set_fs() is not manually tweaked by the kernel.

We could track the syscall nesting, and the script code can use kernel_*() instead of user_*() for those parameters when nesting is detected.

See also bug #6762.
Comment 3 David Smith 2013-06-03 13:38:24 UTC
I wonder if we couldn't have a '_stp_copy_from_user_or_kernel()' function that would do the correct copy based on the current set_fs() value. If we wanted to get even more strict we could only call that one if 'current' is a 32-bits-on-64-bits process.
Comment 4 Frank Ch. Eigler 2013-06-04 15:14:41 UTC
ISTM that the places where we do _stp_copy_from_user in the
tapset, we could conditionalize those functions based on get_fs(),
dispatching to a new _stp_copy_from_kernel() or somesuch.  These
would account for the kernel's internal set_fs games that are
known to occur on those particular parameters.  I would disfavour
opening up the automation in the, as a belt&suspenders kind of
safety measure against inappropriate information disclosure.

BTW, tapset/linux/aux_syscalls.stp, _struct-itemerspec_u(), with
its NULL / UNKNOWN coding style, is implicated in bug #15044.
Comment 5 David Smith 2013-06-05 17:57:56 UTC
Created attachment 7057 [details]
beta patch

Here's a beta patch, that adds a function to copy from the user or kernel memory (it just uses the current memory segment) and makes a couple of the functions in aux_syscalls.stp call the new function.

This patch doesn't fix all the issues in the tapset, I'm just looking to see if the approach looks reasonable.

One thing to think about is which functions in aux_syscalls.stp do we modify? Do we go ahead and modify them all (that use _stp_copy_from_user()) or do we try to only do it where we know there are compat functions that switch to using kernel memory?
Comment 6 David Smith 2013-06-06 18:02:55 UTC
Created attachment 7063 [details]
beta 2 patch

Here's a second version of the patch, that tries to be more paranoid about when to allow reading kernel memory, especially for Unprivileged users.
Comment 7 David Smith 2013-06-11 14:37:38 UTC
Created attachment 7067 [details]
anti-nested syscall patch

Here's a new patch that attempts (and fails) to fix this problem a different way.

Instead of changing the way we access memory, instead this patch tries to avoid the nested system calls. To do this, in the 64-bit syscall .call and .return probe, I added the following line:

	if (%{ _stp_is_compat_task() %}) next

So, if we're in a 64-bit syscall with a 32-bit process, skip the probe (and let the 32-bit syscall handle it). (That line could be prettied up by using a macro.)

As I mentioned above, the advantage this approach has is that it doesn't change the way we access memory. The disadvantages are:

- Knowing when and where to add that line. The problem is that some 32-bit syscalls are wrappers around the 64-bit syscall, and some 32-bit syscalls call the same underlying function that the 64-bit syscall does (after massaging the arguments a bit). So, in the 1st case (the wrapper case), we need the probe skip line, but in the 2nd case we don't.  I used the syscall.exp test case to know where to put that line when developing the patch, but the syscall.exp test case doesn't test every system call.

- Whether we need that line or not can change with kernels. For example, the syscall.rt_sigprocmask probe should be skipped on RHEL[56] kernels, but not on rawhide.

So, some more work will be required here.
Comment 8 David Smith 2013-06-17 16:10:43 UTC
(In reply to David Smith from comment #7)
> - Whether we need that line or not can change with kernels. For example, the
> syscall.rt_sigprocmask probe should be skipped on RHEL[56] kernels, but not
> on rawhide.

I couldn't figure out why this change didn't work for rawhide, so I dug in a bit more. Here's what I found. Over in arch/x86/include/generated/asm/syscalls_32.h (which gets generated during the kernel build), you'll see this:

====
__SYSCALL_I386(173, sys_rt_sigreturn, stub32_rt_sigreturn)
__SYSCALL_I386(174, sys_rt_sigaction, compat_sys_rt_sigaction)
__SYSCALL_I386(175, sys_rt_sigprocmask, sys_rt_sigprocmask)
__SYSCALL_I386(176, sys_rt_sigpending, compat_sys_rt_sigpending)
__SYSCALL_I386(177, sys_rt_sigtimedwait, compat_sys_rt_sigtimedwait)
__SYSCALL_I386(178, sys_rt_sigqueueinfo, compat_sys_rt_sigqueueinfo)
__SYSCALL_I386(179, sys_rt_sigsuspend, sys_rt_sigsuspend)
====

Notice that for syscall 175, rt_sigprocmask doesn't call 'compat_sys_rt_sigprocmask', it just calls the real 64-bit 'sys_rt_sigprocmask' (see the 3rd argument and compare to rt_sigaction or rt_sigpending).

On RHEL[56], the compat function for rt_sigprocmask does get called.

(Also note that 'rt_sigreturn' calls 'stub32_rt_sigreturn', which the syscall tapset isn't expecting. So, we probably don't handle a 32-bit rt_sigreturn properly.)
Comment 9 David Smith 2013-08-26 20:44:16 UTC
Created attachment 7166 [details]
anti-nested syscall patch 2

Here's yet another approach to fixing this one. This one attempts to solve the nested-syscall problem by defining 2 macros: '@__syscall_enter' (called at each syscall entry probe) and '@__syscall_exit' (called in each syscall return probe). It just uses a global array to figure out if we're already in a system call in this thread. If so, the probe is skipped.

Here are some notes:

- I just modified the syscall tapset enough to get the 32-bit clock syscall passing, I didn't modify every place that needs changes.

-There are a couple of problems with using a macro library file, listed there. The best way of fixing these problems might be to combine syscalls.stp and syscalls2.stp so that the macros could be defined in that file.

- Similar changes would need to be made for the nd_syscall tapset. It would probably need its own global array.
Comment 10 David Smith 2013-10-08 18:03:15 UTC
Created attachment 7226 [details]
different anti-nested syscall patch

Here's a patch that attempts to fix this problem by avoiding nested syscalls. It does this by looking at the syscall number. If we're in syscall.clock_settime, but the syscall number isn't __NR_clock_settime, return. This is done with macros stored in tapset/linux/syscalls.stpm, called '__syscall_enter' and '__syscall_enter2'.

With this patch, the syscall test case passes completely on rawhide x86_64 (3.12.0-0.rc2.git0.1.fc21.x86_64).

Note that the patch isn't complete. The nd_syscall tapset files would need to modified in a similar manner. Also, all nested syscalls weren't avoided, just the ones needed for syscall.exp to pass.

Also note that the macros could use better names, especially since they also get called in .return probes.

Comments welcome.
Comment 11 David Smith 2013-10-08 18:15:46 UTC
One more "todo" about this patch. It is very x86 centric, with the use of some '__NR_ia32_SYSCALL' defines. We'll probably have to define our own "generic" compat syscall numbers.
Comment 12 David Smith 2013-10-31 17:24:39 UTC
Created attachment 7262 [details]
simpler patch

Here's a simpler version of the last patch that just fixes one syscall.exp failure for the 32-bit 'clock' subtest when run on rawhide x86_64. This simpler patch should make this approach easier to review and understand.

Note that we'd have to make similar changes to nd_syscalls.stp.

Also, it might make sense to combine the syscall.clock_nanosleep and syscall.compat_clock_settime probe aliases with their 'compat' probe aliases.
Comment 13 David Smith 2014-01-07 22:02:32 UTC
(In reply to David Smith from comment #12)
> Created attachment 7262 [details]
> simpler patch
> 
> Here's a simpler version of the last patch that just fixes one syscall.exp
> failure for the 32-bit 'clock' subtest when run on rawhide x86_64. This
> simpler patch should make this approach easier to review and understand.
> 
> Note that we'd have to make similar changes to nd_syscalls.stp.
> 
> Also, it might make sense to combine the syscall.clock_nanosleep and
> syscall.compat_clock_settime probe aliases with their 'compat' probe aliases.

I've made changes to the syscall tapsets to implement this idea. At this point, the syscall.exp and nd_syscall.exp testcases pass on every platform I've tested on.

(Note that these changes don't really fix *all* nested syscall problems, just the ones tested by the testsuite. To catch more failures, we'd need to expand the testsuite.)