This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: breakpoint assistance: single-step out of line

From: Roland McGrath <roland at redhat dot com>
To: Jim Keniston <jkenisto at us dot ibm dot com>
Cc: systemtap at sources dot redhat dot com
Date: Wed, 28 Mar 2007 21:40:34 -0700 (PDT)
Subject: Re: breakpoint assistance: single-step out of line

> > Instruction decoding needs to be robust, not presume the canonical subset
> > of encodings normally produced by the compiler, as used in the kernel.  On
> > machines other than x86, this tends to be quite simple.  On x86, it means
> > parsing all the instruction prefixes correctly and so forth.  I think the
> > parsing should be done at breakpoint insertion time, caching just a few
> > bits saying what fixup strategy we need to use after the step.
> 
> I guess that depends on how complicated the switch(opcode) { ... } code
> in uprobe_resume_execution() gets.  Parsing the instruction at
> probe-insertion time is essential for x86_64, at least partly because of
> rip-relative addressing, as you discuss below.

The parsing will always be more costly than checking a few bits.  You're
always going to be doing it at insertion time anyway; there's no reason not
to cache the results of that across the board.

> The approach we had in mind was to change the rip-relative instruction
> to an indirect instruction where the target address is in a scratch
> register (one not accessed by the original instruction).  Save the value
> of the scratch register, load in the target address, single-step, and
> restore the scratch register's real value.  This isn't coded yet.

This requires decoding the instruction in more detail than we've done
before, to be sure of what register is free.  Maybe that isn't really all
that hard, but I'm not sure--I think there are a lot of cases to be sure
what the target register is.  By contrast, the segment prefix is very easy
to parse.  The normal register fiddling is probably more efficient than the
fs/gs fiddling, but unless it's drastic I think keeping the instruction
decoder simpler is the overall win.

> We were thinking in terms of a per-process page that's automatically set
> up at exec time.  There's no dso involved in our approach, but the
> "vdso" reference has been hard to kill.
[...]
> Our current approach uses a fixed-size area (1 page for now) that's
> allocated at exec time.  
[...]
> Sounds good, although I personally don't know the incantation for
> putting the vma there.  Any help would be appreciated.
[...]
> I'm not so worried about the visibility of the area in /proc/*/maps and
> such; protecting it from munmap & friends seems more of a concern.

You can't do anything about that without either not using a proper vma, or
adding some new VM_* flag and make the kernel enforce that normal user
calls can't change it (which might as well be the flag that says to hide it
in /proc too).

Also note that ptrace or suchlike can always come and modify the page too,
even if the user has not made it writable.  So whatever bits you store on
that page, you must always use with care in the kernel.  Scrambling that
page must not be able to produce any bad effects in the kernel, nothing
worse than a scrambled user context in a thread in that address space.

The patch you posted is a non-starter.  I think I get now why you keep
thinking "vdso"--you mean an unaccounted mapping of an unaccounted page
that will never be paged out.  This is several kinds of bad, I won't go
into the details of why.  I'm sorry I wasn't more explicit about how to
keep it simple.  For not even trying to do any special hiding magic, it
didn't occur to me that you'd do anything but this:

	#define SLOT_SIZE		...
	#define	SLOT_AREA_SIZE		PAGE_ALIGN(NR_CPUS * SLOT_SIZE)

		struct mm_struct *mm = current->mm;
		unsigned long addr;

		down_write(&mm->mmap_sem);
		/*
		 * Find the end of the top mapping and skip a page.
		 * If there is no space for SLOT_AREA_SIZE above
		 * that, mmap will ignore our address hint.
		 */
		addr = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct,
				vm_rb)->vm_end + PAGE_SIZE;

		addr = do_mmap_pgoff(NULL, addr, SLOT_AREA_SIZE, PROT_EXEC,
				     MAP_PRIVATE|MAP_ANONYMOUS, 0);
		if (addr &~ PAGE_MASK)
			... -errno = addr ...;
		up_write(&mm->mmap_sem);

If we think up useful tweaks to make the vma more special, add (before
up_write):

		vma = find_vma(addr);
		vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; // or whatever

Also, doing preemptive allocation at exec time does not wash with me.  Your
version has an extra unpageable page per process as well, but with a normal
allocation it's still a gratuitous vma per process.  Most processes will
never be probed.  I don't think this universal overhead is warranted.
Allocating on demand at first probe insertion makes sense to me.  Using the
top address area means it's unlikely you'll ever interfere with normal
mappings anyway, and if somehow none available at insertion time, then
tough, you don't insert.

Sorry, I really thought the vma was the trivial part of this and not the
interesting one.  I'd like to see the robust instruction decoding work.


Thanks,
Roland

Follow-Ups:
- Re: breakpoint assistance: single-step out of line
  - From: Jim Keniston

References:
- Re: breakpoint assistance: single-step out of line
  - From: Jim Keniston

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]