Bug 13078 - investigate qemu virtio-serial channel for talking to stap-sh
Summary: investigate qemu virtio-serial channel for talking to stap-sh
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-11 18:21 UTC by Frank Ch. Eigler
Modified: 2013-10-18 14:26 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Add a --remote unix:/path target (2.20 KB, patch)
2011-08-17 00:40 UTC, Josh Stone
Details | Diff
brief interactive stapvirt how-to console session (2.63 KB, text/plain)
2013-09-26 20:54 UTC, Jonathan Lebon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Ch. Eigler 2011-08-11 18:21:19 UTC
qemu's virtio/serial widget allows a secure bidirectional channel for
the host to talk to a guest.  We could investigate setting up some guest-side
infrastructure to have a copy of stap-sh listen on one or more such channels,
to launch stap modules without having to set up ssh.

Possible user interface:
 
(HOST)
% qemu-kvm -chardev socket,id=systemtap.0,server,nowait,path=/PATH/TO/SOCKET ...
% stap --remote=unix:/PATH/TO/SOCKET ...

(GUEST)
Have a systemd or sysvinit script to start up / respawn
  stap-sh </dev/virtio-ports/systemtap.0 >/dev/virtio-ports/systemtap.0 2>&1


See also:

http://www.linux-kvm.org/page/Virtio-serial_API
https://rwmj.wordpress.com/2011/07/07/how-does-libguestfs-live-work/
Comment 1 Josh Stone 2011-08-17 00:40:56 UTC
Created attachment 5902 [details]
Add a --remote unix:/path target

The new code is the class unix_stapsh, which is pretty straightforward.  The rest is fixing broken assumptions: get_reply() now skips over dbug messages in case stderr is joined, and handle_poll() has to deal with fdin == fdout.

I haven't tried anything virtualized yet, but this appears to do the trick for simple local sockets.  In one terminal, perhaps in a loop:
  socat UNIX-LISTEN:/tmp/foo EXEC:'stapsh -vvv'

The -v flags are at your discretion.  As is, this will keep all stderr local, but you can use "EXEC:'stapsh -vvv',stderr" to send that over the socket too.

In another terminal:
  stap --remote unix:/tmp/foo ...

et voila!
Comment 2 Josh Stone 2011-08-17 22:54:26 UTC
(In reply to comment #1)
> Created attachment 5902 [details]
> Add a --remote unix:/path target

Committed as e7e8737f, with a small addition to attempt dup'ing the socket fd.  IMO this makes it a bit cleaner with having multiple fdopens.
Comment 3 Jonathan Lebon 2013-08-09 18:17:32 UTC
<brain_dump>

I'm currently working on making this work with virtio-serial. However, there are two peculiarities of virtio-serial to be aware of:

(1) There is no way for the host to know if/when the guest is connected. In our case specifically, when stapsh exits, stap doesn't get EOF at its fgets() or fread(). It just hangs there. This is by design since the virtio-serial ports are meant to be hot-pluggable.

(2) There can be old data left on the virtio-serial buffer. If stapsh ended abruptly for some reason and didn't read all the data from /dev/virtio-ports/systemtap.0, then the next stapsh session will pick up that data. Similarly, stap has to be ready to read garbage from /tmp/foo before reaching messages from the current stapsh session.

To work around (1), we can make stapsh send a QUITTING message to stap before exiting. However, this would be indistinguishable from a possible output from staprun. One way around this is to prepend all the output from staprun with "staprun:" and replies/messages from stapsh (such as the OK acks) with "stapsh:". This would also mean always reading from /tmp/foo line-by-line rather than in blocks, regardless of --remote-prefix.

To work around (2), we can use a synchronization scheme in which the handshake also includes a unique/random token, which is then used throughout the session.

I'm currently working on these workarounds.

Finally, I'm also thinking about adding a -d option to stapsh so that it acts as a daemon, i.e. constantly waiting for input on the virtio-serial port (this would be a bit harder than just calling read() since, as mentioned in (1), read() will just return EOF if stap hasn't opened the socket yet -- I may have a resolution for this without having to resort to using the KVM virtio-serial's API).

And to bring it all together, have a script/small app which can modify libvirt-managed VMs using virsh/libvirt API/libxml to add/verify proper virtio-serial port setup, as well as somehow setting up a systemd service for stapsh, possibly using libguestfs to modify the harddrive directly. This small app could even down the road be used to do something like

stap --remote vm:VM_NAME

and it would take care of looking into the VM's configuration to find the UNIX socket's path. This is actually what libguestfs does currently.

For non-libvirt-managed VMs, we can just document the command-line that users need to add to the qemu invocation.

</brain_dump>
Comment 4 Josh Stone 2013-08-20 21:40:12 UTC
One concern I want to document is compatibility.  I have discussed this with jlebon on IRC, and he's already started addressing it in his prototype code.  The gist is that we can't assume that stap on the host is the same version as stapsh/staprun/etc on the remote target.  So both sides of the connection have to be prepared to talk to old versions too -- a new stap still has to understand an old stapsh, and likewise a new stapsh has to be able to talk to an older stap.

Thankfully, this concern was foreseen enough that the initial command handshake has always included version info by both parties.  So even though this protocol hasn't yet been changed since its introduction, it's prepared for a version negotiation.  Still, any changes should be carefully considered and strongly justified, so we can keep the code dealing with compatibility as simple as possible.

> To work around (1), we can make stapsh send a QUITTING message to stap before
> exiting. However, this would be indistinguishable from a possible output from
> staprun. One way around this is to prepend all the output from staprun with
> "staprun:" and replies/messages from stapsh (such as the OK acks) with
> "stapsh:". This would also mean always reading from /tmp/foo line-by-line
> rather than in blocks, regardless of --remote-prefix.

Beware that script output can include binary data, so treating everything line-oriented seems problematic.  The current code on jlebon/cross-vm even removed a note on this, without addressing the issue:

  // NB: The buf could contain binary data,
  // including \0, so write as a block instead of
  // the usual <<string.

When the user specifies --remote-prefix, we go line-oriented by their choice, but I don't think we can get away with always doing this.  If the user wants binary data over stapsh, it should be possible.

To handle this, I think the stapsh protocol should always treat its data shuffling as binary.  That means your current prefix should be more like a block header, or like the existing "file" command but in the other direction.  Perhaps something like "data stdout SIZE\n" followed by the raw bytes.  On stap's side it can decide to write out the raw data or use lines depending on --remote-prefix.  There can be "data stderr SIZE" too, for both staprun's stderr and stapsh's verbose logging, for instance.

The compatibility story is then that stapsh doesn't send any "data" header or things like QUITTING to old stap, and stap with an old stapsh should read lines or binary just as it did before, without expecting that "data ..." line.

I'd also suggest that stapsh just say "quit" at its end, rather than QUITTING, mirroring the stap command.  We're basically establishing a command stream in the stapsh->stap direction, with "data" and "quit" as the first possibilities.

I hope this idea keeps the communication a little simpler and a bit more uniform.  Let me know what you think.
Comment 5 Frank Ch. Eigler 2013-08-21 13:29:24 UTC
Further to the compatibility angle, consider too that even with (future)
stap 2.4 + stapsh 2.4, we don't want to have to use the stdout/stderr
multiplexing mode for transports such as ssh, which handle stdout/stderr
for us separately and well already.

So we could have that stapsh -d option (renamed --stdio or --persistent
or --multiplex?) modify the handshake so that stapsh identifies itself as
"stapsh-d VERSION ...." to stap, at which point they both go into the
explicitly-multiplexed mode.
Comment 6 Jonathan Lebon 2013-08-21 14:00:30 UTC
Thank you both for the feedback. I really like the idea of a "data" and "quit" command. I'll take everything discussed here into account as I rewrite the stapsh modifications.
Comment 7 Josh Stone 2013-08-21 17:31:59 UTC
(In reply to Frank Ch. Eigler from comment #5)
> So we could have that stapsh -d option (renamed --stdio or --persistent
> or --multiplex?) modify the handshake so that stapsh identifies itself as
> "stapsh-d VERSION ...." to stap, at which point they both go into the
> explicitly-multiplexed mode.

Unfortunately, in this case stapsh::set_child_fds() is too picky about what it receives.  We could relax that, but old staps will still be an issue.

    // stapsh VERSION MACHINE RELEASE
    vector<string> uname;
    tokenize(reply, uname, " \t\r\n");
    if (uname.size() != 4 || uname[0] != "stapsh")
      throw runtime_error(_("failed to get uname from stapsh"));

We could have stap explicitly ask for this multiplexing, assuming it knows this is a connection that would need it.  (e.g. used for a "vm:..." scheme, or maybe even all "unix:" needs this.)  This can be a new option command, like "option data" for stdout/err encapsulation, enabling "quit" at that time too since it depends on encapsulation.  I can also imagine something like "option compress" being useful in the future, and maybe "option dyninst" to make the run command use stapdyn for PR14711.
Comment 8 Jonathan Lebon 2013-08-26 18:07:33 UTC
I've just deleted and recreated the jlebon/cross-vm branch containing multiple modifications as discussed here. Here is the main diff:
- Add an "option" command to stapsh. E.g. "option verbose" increments verbosity and "option quit" causes stapsh to send "quit" to stap upon exiting.
- The "option data" command turns on 'prefixing' which prefaces all outputs from staprun by "data stream size".
- For the unix scheme, turn on the "data" and "quit" options.
- To deal with handling commands and data printing, a new class ShStream was added in remote.cxx and is used by handle_poll().
- Add a -l [PORT] option to stapsh to listen to any old serial port.

There are three things that remain to be done:
- Also use SIGIO so that we have immediate knowledge of the host connecting rather than busy polling.
- Do the TODO in ShStream::print so that scripts such as 'timer.s(1) { printf("hello") }' --remote-prefix does not cause the output 0: hello0: hello0: hello
- (Once stap-vm is complete) Create a qemu_stapsh class (and possibly a new scheme qemu://VMNAME to go along with it) and move the "quit" option from unix_stapsh to qemu_stapsh, since unix does not strictly need the quit option.
Comment 9 Jonathan Lebon 2013-08-27 16:19:59 UTC
After talking with jistone on IRC, we decided to try out using poll() instead of threads in order to simplify things. The latest commit on jlebon/cross-vm does this. Additionally, staprun is now only piped when necessary, as opposed to unconditionally.
Comment 10 Jonathan Lebon 2013-08-28 16:21:51 UTC
I just pushed a few more commits on jlebon/cross-vm to address two of the todos mentioned in comment 8. One of them adds better handling of --remote-prefix when printing on the same line, and another adds SIGIO handling if available.
Comment 11 Jonathan Lebon 2013-08-30 20:37:49 UTC
I created a udev rules file as well as a systemd template service file which placed together in the VM spawns a stapsh instance for each detected port of the type org.systemtap.[0-9]* (naming not formalized yet). Pretty sweet!

So all we would need to do is to make these files part of systemtap-runtime for users to install in their VM (no added dependencies!) and on the host side have stap-vm add the ports to the VM's definition to get the whole thing working.

For systems that do not have systemd, we can have a simple bash script which looks for any of the org.systemtap.* ports and starts up one stapsh for each of them (or exits if no ports are found). It would be helpful to make this script long-lived, respawning any of the stapsh instances as they exit.

A minor issue is SELinux. It blocks qemu from creating sockets anywhere other than in a directory with the qemu_var_run_t context. One good location satisfying this is /var/lib/libvirt/qemu (recommended by dberrange in BZ598533#c6).

The last major hurdle left is permissions. We have no issues on the guest side since systemd/init spawns stapsh. However, on the host side, the UNIX socket is created with qemu:qemu 755. There are no options to change this. These permissions come from a umask(022) call in libvirtd, which qemu then inherits. I'll have to contact the libvirt guys to see if they have any ideas.
Comment 12 Jonathan Lebon 2013-09-04 16:18:48 UTC
WARNING: I'm sorry in advance for breaking heads with this post

There are two things that have come to light since the last comment:

(1) Support for hot-plugging virtio-serial ports was added in libvirt 1.1.1 [1][2]. If this works as expected (will test soon), it means that we could do one of two things:
   - Do away with stap-vm completely and have stap hot-plug a port for the duration of the session only. This would mean adding a dependency to libvirt.
   - Keep stap-vm (maybe part of a different package) and offload the hot-plugging work to it (which stap would then use). It could also be used by users to permanently add ports to their VMs (as originally envisioned), which would still be needed for machines with a libvirt version < 1.1.1.

(2) There is a slew of functions in the libvirt API which abstract away the virtio-serial port [3]. I was informed this is the recommended way to interact with virtio-serial ports on the host side (and avoid the permission issue discussed in comment 11), but would also add a dependency to libvirt.

Note that AFAIK, libguestfs does not use this API. They do not encounter permission issues because they create their own socket and make qemu connect to it (in fact they have the reverse issue, to make sure that qemu has permission to connect to their socket). This wouldn't work in our case unless we have the VM start after stap (or we could have a utility that runs on boot to create the socket and listen on it).

---

Conclusion/recap of what we could do (non-exhaustive):

A. Keep stap-vm and use it for hot-plugging and installing permanent ports for libvirt < 1.1.1, as well as for brokering the stream using the API (perhaps expose a unix socket to stap, or even just stdin/out). Root would be needed for this (e.g. setuid, or user input). I'm not sure how much work is involved if we were to use the stream APIs, and whether stap/stapsh would need to be modified.

B. Keep stap-vm and use it for hot-plugging and installing permanent ports for libvirt < 1.1.1, but don't use the stream API. Make it change the socket's permission on-the-fly so that stap can then connect to it the usual way. Root would be needed for this (e.g. setuid, or user input).

C. Don't use stap-vm, make stap hot-plug a port for the duration of the session. Don't use the stream API, connect directly to the UNIX socket. Adds a dependency to libvirt in stap. We would need root access for this as well.

I personally prefer option B because it involves less new code to maintain and it's easier to understand.

Let me know what you think!

[1] https://www.redhat.com/archives/libvir-list/2013-July/msg00125.html
[2] http://libvirt.org/news.html
[3] See the virStream* functions and virDomainOpenChannel() at http://libvirt.org/html/libvirt-libvirt.html
Comment 13 Jonathan Lebon 2013-09-11 18:49:11 UTC
I just recreated the jlebon/cross-vm branch with all the 'squashme' commits squashed and an initial version of stap-vm. It ended up being a bit larger than expected (almost exactly 1000 lines) but I made sure to keep the code clear and easy to follow. :)

The next step is to now modify remote.cxx to add a vm://DOMAIN scheme which shouldn't be too hard. The logic would be something like this:
- Call stap-vm query DOMAIN to make sure it's a valid domain
- Call stap-vm port-list DOMAIN to get a list of available ports
- If no port is available or none of the ports could be opened and hotplugging is supported, call stap-vm port-hotplug-add DOMAIN to add a domain on the fly.
- Once a port has been obtained, use that to talk to stapsh (maybe create a unix_stapsh instance and hand it off)
- Once we're done with the port, if it was obtained using port-hotplug-add, then call stap-vm port-hotplug-remove DOMAIN to remove the port.
Comment 14 Jonathan Lebon 2013-09-24 21:23:15 UTC
Much has been done since the last update. Almost everything has been implemented. To try to keep this PR self-contained, here is roughly what has been done:
- Added stapvirt.c (referred to previously as stap-vm)
- Added the libvirt:// scheme in remote.cxx
- Added stapvirt to the build system
- Added stapvirt.1 man page and cleared up the --remote section in stap.1
- Added sysvinit scripts for non-systemd based systems

Items that remain include:
1. Modifying systemtap.spec
   - To be discussed. E.g. we could have the 'guest' related files (stapsh@.service, 99-stapsh.rules, sysvinit scripts) in a guest package, e.g. 'systemtap-runtime-virtguest', and the 'host' related files (which is just stapvirt) in e.g. 'systemtap-runtime-virthost'
2. Testing the sysvinit script on a RHEL5 machine

If you'd like to take a look, I created the jlebon/cross-vm-clean branch which contains larger diffs for your viewing pleasure. Any testing and feedback would be welcome!
Comment 15 Jonathan Lebon 2013-09-26 20:54:51 UTC
Created attachment 7217 [details]
brief interactive stapvirt how-to console session

Small 'tutorial' on using stapvirt to run probe inside virtual machines.
Comment 16 Jonathan Lebon 2013-10-18 14:26:39 UTC
Merged into main branch (commit 2459a42). Two new packages have been added: systemtap-runtime-virthost and systemtap-runtime-virtguest.