Summary: | binary tracing | ||
---|---|---|---|
Product: | systemtap | Reporter: | Frank Ch. Eigler <fche> |
Component: | translator | Assignee: | Martin Hunt <hunt> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | guanglei, guij, hiramatu, hunt, jkenisto, jrs, prasadav, zanussi |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: | 2006-02-20 03:00:04 | |
Attachments: |
Binary Tracing runtime interface and example script; stpd patch for merge
updated binary tracing prototype (runtime interface and example script; patch for merge) binary tracing prototype code modified to remove hard-coded packet header add bin_*printf() interface for comparison prototype generic binary tracer using generic binary tracing interface somehow bin_write.c didn't get into the prototype patch - here's a new patch |
Description
Frank Ch. Eigler
2006-02-09 03:55:22 UTC
The following are my thoughts about binary trace interface. In a summary, I prefer Martin Hunt's ideas. The idea Tom proposed will bring the programming convenience and performance improvement. But it has some problems: 1. required to introduce struct into systemtap script language. 2. Inefficiency when refer to several fields of a struct target in different assignments. e.g: ... trace_scsi.hostno = $cmd->device->host->host_no trace_scsi.channel = $cmd->device->channel trace_scsi.lun = $cmd->device->lun ... each reference to $target will generate a serial of function calls(function__tvar_get_cmd, fetch_register, deref etc). This is why I don't use printf directly in LKET in most cases, but use _stp_printf() in embedded c code. 3. Another issue is that what will happen if some fields of a struct is a string? e.g: trace_process_fork.processname = str_processname How should systemtap generate the corresponding struct? What's the size of the string field of that struct? So after thinking about this topic, I prefer what Martin Hunt proposed: (http://sources.redhat.com/ml/systemtap/2006-q1/msg00284.html) It looks like a more generic interface, and it allows mixture of data types(binary & string). We could implement it in the runtime lib, naming it _stp_bin_printf(const char *fmt, ...). We could introduce a new function: bin_vsnprintf based on the existing vsnprintf to support _stp_bin_printf, which will directly assign the integer value instead of converting the integer into a ASCII string. Of course we should modify stpd to let it support binary data transfer, but I think BTI should already done about this, maybe only a slight change is required. Jose also implemented a feature which will use /proc to control whether the probe handler should start/stop printing data. So we could also introduce _lket_printf(const char *fmt, ...) using bin_vsnprintf which will check the /proc to see if it should print data now and will also do some common work like log_tracedata_common(). But I think start/stop tracing by /proc is useful and should be incorporated into systemtap. By having _stp_bin_printf/bin_vsnprintf/_lket_printf, we can introduce some wrapper functions, such as lket_trace(). One idea is to introduce lket_print into translator, just like print, so that we could using lket_print("%d%d%s%d", int1,int2,str1,int3) in script directly. I am not sure if I missed some points. (In reply to comment #1) > The following are my thoughts about binary trace interface. In a summary, I > prefer Martin Hunt's ideas. Basically, I agree with his ideas too. I suggest introducing the compressed format when the arguments of the same type continue. For example: printb("%d%d%d%d", long1, long2, long3, long4); This expression will be compressed as below; printb("%d4", long1, long2, long3, long4); This compressed format is simple and reduces transporting data size. By the way, I also hope the name of the interface is more neutral as printf ;-). > printb("%d4", long1, long2, long3, long4); > > This compressed format is simple and reduces transporting data size. > In fact, this is almost the same as what I did in LKET(log_tracedata_common_compact & log_tracedata_common_compact_more in logtrace.stp). It should be useful for someone, although not much useful for LKET maybe. > By the way, I also hope the name of the interface is more neutral as printf ;-). How about _stp_bin_printf/bin_vsnprintf corresponding to current _stp_printf/vsnprintf? Any good suggestions about their names? _lket_printf is intended for LKET only. Before we go deciding on a name we need to first decide on a general concept of the interface that's going to be implemented. I also like Martin's method for implementing binary tracing since I find it very natural for me. Since we have a couple of people voting for this interface, is Martin's method the way to go? It seems to me that an approach along these lines will not satisfy optimized tracing along IBM's original request (re. struct support in the language), nor permit much or any translator assistance to provide metadata to a binary decoder. I consider it a technically weak solution. On the other hand, just extending printf with binary formatting directives should be easy, and not cause adverse interference with other future designs. So, if people want to experiment with this part, go ahead. Both proposals require some changes to the transport interface to improve efficiency. It will also improve performance of printf to have our own optimized copy of vsnprintf() instead of using the kernel version. Adding binary support to that will be trivial. So I can add binary support to printf and if we change our minds later and want to try the other way, we haven't really wasted any time. I'll get started on it. Martin, we are trying to improve LKET based on the new binary trace interface. Thus implementing such an interface will have the first priority for us currently. I also did some investigations about it, e.g the binary version of vsnprintf and the enhancement of stpd to support binary & ascii package(mentioned by gBTI). So if you like, I can help to do some work of this new binary trace interface. Created attachment 899 [details]
Binary Tracing runtime interface and example script; stpd patch for merge
I've made a prototype of binary tracing (binary_trace.tar.bz2)based on the discussion in the maillist, especially originated from Hitachi's gBTI and Martin's idea. It stores integers in binary format, handles both integer and string arguments and uses relayfs buffer directly. As a start point, it is now only tested on 2.6.14(x86) and Redhat 2.6.9-30EL(ppc64). Your comments and suggestions are very welcome. How to use: -------------------------------------------------------------------- 1) patch systemtap-20060225 src with the stpd patch patch -p1 <../patch-binary_trace 2) re-install systemtap 3) run the script: stap -g -b bin_trace.stp Here is the detailed design: -------------------------------------------------------------------- 1. runtime function A runtime function bin_strace() is provided to write a binary packet into the relayfs buffer directly. void bin_strace(String Str, const char *fmt, ...); Its usage is similar to _stp_sprintf(), e.g. bin_strace(_stp_stdout, "Traceformat: %1b|%2b|%d|%s|", 0xAA, 0x1122, 0x55667788, "hello"); It has some main features: 1) it supports "%b","%s" "%d" and can print integers and strings 2) it stores integers in binary format not in ascii 3) it writes trace data into relayfs buffer directly, while in current ATI, the trace data are formatted into stp-print-buffer first and then transported to relayfs buffer. The binary packet design is an extension to gBTI packet. It looks like: [32bit seq][8bit flag][8bit size][args ...] seq: shared with ATI packets by using _stp_seq_inc(). it will be used for merge pre-cpu outputs flag: always 0 for a binary packet. it's used to distinguish between binary and ascii packets. size: total length (bytes) of actual content of trace data. it's calculated and filled by bin_strace(). args: integers (1byte, 2byte or 4byte) or strings (string len is in first byte for post-processing; ending '\0' char is removed) 2. stpd enhancement stpd is enhanced to recognize the binary packet header and merge the binary packets as well as ATI packets. The header info (seq,flag,size) will not be kept in "probe.out". 3. wrapper functions Different wrapper functions can be defined over this runtime interface for different use. Here we take LKET as example. We can define special wrapper functions for event hooks respectively , e.g. lket_syscall_trace(). They will all use lket_trace() to output common data. Known bug: -------------------------------------------------------------------- When bin_strace() and _stp_sprintf() are used in same handler, the order of generated trace data may be different from the coding order. e.g: probe handler { log(something1); lket_trace(something2); } Here, something2 will be stored before something1. The reason is that the content of log() is not transported to relayfs until _stp_print_flush() is called at the end of handler, while bin_strace() stores data into relayfs buffer directly. Thanks in advance for your comments. Adding interested parties to the CC list. Created attachment 907 [details] updated binary tracing prototype (runtime interface and example script; patch for merge) fix a few bugs in last one and test it on ppc64/2.6.15.4 and i686/2.6.15.4 with systemtap-20060304; use bin_write instead of bin_trace as interface name; integrate the temporary solution for Bug 2387 (systemt crash on ppc64/2.6.15.4) (In reply to comment #11) > Created an attachment (id=907) > updated binary tracing prototype (runtime interface and example script; patch > for merge) > > fix a few bugs in last one and test it on ppc64/2.6.15.4 and i686/2.6.15.4 with > systemtap-20060304; > use bin_write instead of bin_trace as interface name; > integrate the temporary solution for Bug 2387 (systemt crash on ppc64/2.6.15.4) Just a couple comments... It would be nice if this could be made to fit into a slightly larger framework which would make it in essence _a_ systemtap trace format and API rather than _the_ systemtap trace format and API. One approach... On the runtime side, we could add _stp_reserve(), a transport-independent function to reserve event space in the destination buffer. Using this would make your tracing code slightly less efficient because it would have to do the vararg printing into a temporary buffer first, and then memcpy it to the reserved space, but it probably shouldn't matter too much in comparison with the overhead of the bin_*write functions, and it would make the interface cleaner and more generic and available to other tracing formats/APIs. You'd also then have access to all transports, not just relayfs. Of course, if you knew ahead of time how large your data would be you'd avoid the memcpy and could write directly into the reserved space. The bin_*write functions seem generic enough for any application that wants to do this kind of tracing, but maybe it would be more clear to have them called bin_*printf(). The actual trace struct and wrapper functions e.g. stp_packet and lket_* it seems to me should be part of an lket tapset. The other half of this would be the user side, specifically stpd/librelay. One possibility would be to further modularize librelay and provide a hook for format-specific 'handlers' to supply functions or 'plugins' that would handle the trace data in a format-specific way. This would also be useful for handling different format versions for the same application, since formats are likely to change over time. The current format-specific code in librelay would just become another handler specifically for the current ascii scheme, I assume. In summary, I think there needs to be enough flexibility to allow for any number of trace formats and APIs, without locking in or giving preference to a particular one. If the above vague API ideas make sense, I can hash them out further, but I don't think it needs to be much more than a thin layer, basically stp_reserve() in the runtime, and some generic app-specific hook for data handling in librelay. (In reply to comment #12) > On the runtime side, we could add _stp_reserve(), a transport-independent > function to reserve event space in the destination buffer. Using this would > make your tracing code slightly less efficient because it would have to do the > vararg printing into a temporary buffer first, and then memcpy it to the > reserved space Thanks for your suggestion. I'll do this and compare their overhead. Before this prototype, I've tried another design to modify vsnprintf() to support "%b" and keep others untouched, but finally gave up for performance consideration. > > The bin_*write functions seem generic enough for any application that wants to > do this kind of tracing, but maybe it would be more clear to have them called > bin_*printf(). I use bin_*write because it not only a binary version of _stp_printf but also writes to relayfs buffer directly. But you are right, bin_*printf() may be more clear. > > The actual trace struct and wrapper functions e.g. stp_packet and lket_* it > seems to me should be part of an lket tapset. lket_* is part of LKET while the struct stp_packet_h is not. stp_packet_h is similar as gBTI and should be necessary for each binary packet no matter whatever trace format. It is mainly used by stpd to merge pre-cpu files into "probe.out" at the end of tracing. Among its three fields, seq is shared with current ascii packet, flag is used to differentiate between binary packets and ascii ones and size is used to locate the end of current binary packet. (In reply to comment #13) > (In reply to comment #12) > > On the runtime side, we could add _stp_reserve(), a transport-independent > > function to reserve event space in the destination buffer. Using this would > > make your tracing code slightly less efficient because it would have to do the > > vararg printing into a temporary buffer first, and then memcpy it to the > > reserved space > Thanks for your suggestion. I'll do this and compare their overhead. Before this > prototype, I've tried another design to modify vsnprintf() to support "%b" and > keep others untouched, but finally gave up for performance consideration. > > > > The bin_*write functions seem generic enough for any application that wants to > > do this kind of tracing, but maybe it would be more clear to have them called > > bin_*printf(). > I use bin_*write because it not only a binary version of _stp_printf but also > writes to relayfs buffer directly. But you are right, bin_*printf() may be more > clear. > > > > The actual trace struct and wrapper functions e.g. stp_packet and lket_* it > > seems to me should be part of an lket tapset. > lket_* is part of LKET while the struct stp_packet_h is not. stp_packet_h is > similar as gBTI and should be necessary for each binary packet no matter > whatever trace format. It is mainly used by stpd to merge pre-cpu files into > "probe.out" at the end of tracing. Among its three fields, seq is shared with > current ascii packet, flag is used to differentiate between binary packets and > ascii ones and size is used to locate the end of current binary packet. > I understand, but what if some other application, or even lket 6 months down the road, wants to use a different sequencing method, or doesn't need to stream alongside ascii, or wants to be able to log events larger than 256? Why force them to use this event definition when it's not necessary? I was also suggesting that the merging code in stpd be removed from librelay and moved into a separate handler, with a suitable API to allow any application to supply merging methods based on its own sequencing method, or something like that. It seems to me that the actual trace data can be completely opaque except to the specific trace application, and probably all that should be necessary would be information necessary to distinguish between applications and within that trace format versions. How about say a u32 with the high 16 bits identifying the application and the low 16 for use by the application e.g. to identify particular trace format versions? (In reply to comment #14) > I understand, but what if some other application, or even lket 6 months down the > road, wants to use a different sequencing method, or doesn't need to stream > alongside ascii, or wants to be able to log events larger than 256? Why force > them to use this event definition when it's not necessary? Yes, binary packet header would be unnecessary if the application can provide a separate handler to do the merge itself, thus what bin_*write() should do is to write the incoming arguments to the buffer as they are. But for relayfs, I still think it should be maintained by bin_*write() and not by the application, otherwise each application has to explicitly access it and pass it as an argmument. > > I was also suggesting that the merging code in stpd be removed from librelay and > moved into a separate handler, with a suitable API to allow any application to > supply merging methods based on its own sequencing method, or something like that. I agree. But I think current merge method (by seq-id) could be kept as the default choice for those users who don't want to take the trouble to rewrite the merge part. > > It seems to me that the actual trace data can be completely opaque except to the > specific trace application, and probably all that should be necessary would be > information necessary to distinguish between applications and within that trace > format versions. How about say a u32 with the high 16 bits identifying the > application and the low 16 for use by the application e.g. to identify > particular trace format versions? It looks good. If stpd has different built-in merge methods, this will help to choose the proper one. But if the merge handler is provided by the application, this should be maintained by the application itself. (In reply to comment #11) > Created an attachment (id=907) > updated binary tracing prototype (runtime interface and example script; patch > for merge) It seems to trip on the same issue of gBTI. I think the binary transport interface should support the way to output the contents of binary packets to the console, because of usability. Why don't you pass stpd the format string to notify the way to display that? And also, the BTI should be supported by stap as a embedded function(like printf()). Those embedded functions are handled as reserved interfaces by stap's parser(see parser::parse_symbol() in parser.cxx). Thus, the BTI will be able to handle variable arguments naturally. (In reply to comment #16) > It seems to trip on the same issue of gBTI. Thanks for the ideas from gBTI. > I think the binary transport interface should support the way to output the > contents of binary packets to the console, because of usability. > Why don't you pass stpd the format string to notify the way to display that? I think the conversion of binary to console output is important, but should be implemented by tapset providers better than in stpd. As Tom suggested above, the application can provide handlers for merge. It's also possible for the application to provide handlers for conversion. The job of binary transport interface and stpd is tring to provide a generic and efficient infrastructure. Different events require different trace formats. If let stpd handle the details of a trace line, it may cause unnecessary overhead for some applications or add uncomfortable limits for some others. > > And also, the BTI should be supported by stap as a embedded function(like > printf()). Those embedded functions are handled as reserved interfaces by stap's > parser(see parser::parse_symbol() in parser.cxx). Thus, the BTI will be able to > handle variable arguments naturally. I agree. This talk about other "applications" and APIs needs to be made more specific. Do you imagine something linked into stpd? Or stpd just dumping the raw data to disk for some other program to parse after the fact? I don't expect to have more than a handful of tracing interfaces to the scripting side, I don't expect 256 or more different sequencing or wrapping schemes. If the transport code in the runtime will provide a fixed-size buffer reservation function, this will enable the higher performance, compiled style of tracing earlier discussed. As long as that scheme can coexist with the plain ascii trace stream, and this intermediate one, I will be satisfied. (In reply to comment #18) > This talk about other "applications" and APIs needs to be made more specific. > Do you imagine something linked into stpd? Or stpd just dumping the raw data to > disk for some other program to parse after the fact? I don't expect to have > more than a handful of tracing interfaces to the scripting side, I don't expect > 256 or more different sequencing or wrapping schemes. The way I see this working is stpd just dumping the raw data and nothing else. Then we can use other applications to analyse and visualize the data. In reply to comment #18) > This talk about other "applications" and APIs needs to be made more specific. > Do you imagine something linked into stpd? Or stpd just dumping the raw data to > disk for some other program to parse after the fact? I don't expect to have > more than a handful of tracing interfaces to the scripting side, I don't expect > 256 or more different sequencing or wrapping schemes. > > If the transport code in the runtime will provide a fixed-size buffer > reservation function, this will enable the higher performance, compiled style of > tracing earlier discussed. As long as that scheme can coexist with the plain > ascii trace stream, and this intermediate one, I will be satisfied. > I've always thought of stpd as just one instance of many possible systemtap ppplications, and in fact stpd.c is really just a front-end that provides command-line parsing with the real work being done in librelay.c. So the user side of systemtap is already structured to assume there will be other applications; there should be nothing to stop someone from writing his own command by cloning stpd.c and creating a new application or even new system command command built on top of systemtap. So it should be possible for me to write a special-purpose standalone trace utility that for example doesn't require any sequence ids at all (also what about the case of tracing on a UP system, no need for sequence ids in this case either), doesn't use any ascii data, so no ascii/binary flag is needed to distinguish between them, and requires > 256 byte event sizes. Or maybe my application only has 2 events and each of those are never > 128 bytes, so I can stick both my event id and event size into a single byte. So if I'm forced into using the suggested trace API I'm stuck with writing a packet header I'll never use for each event. All I'm asking for is that I have the ability to define my own trace format for my own 'application'. I think the bin_printf() addition to the language should work fine in all cases, but just don't force a mandatory header to be part of it. The 'application' can decide exactly what goes into the trace event from byte 0 to the end of the event. One thing that probably needs to be done to preserve the current functionality is to make the sequence counter incrementation and retrievel a tapset function like get_cycles(), available for those that want to use it when applicable. As far as implementation, tracing applications could be implemented as standalone commands like this e.g. an 'lket' command, or stpd could be enhanced to handle different trace formats depending on the 'application'. stpd could handle a default case of say the get_sequence_id() sequence id in position 0 as it does now, basically defining its own format which it understands how to decipher just as any other application would, and merging the per-cpu files into a single file (some applications might not want to merge per-cpu output files at all, but rather provide a database-like API to efficiently retrieve data from the per-cpu files, imagine huge trace files), but if we're thinking about handling other trace formats in stpd, why not just have them as separate applications in the first place, since in most cases they'd be dedicated to handling only the data output from their bin_printf() calls and won't have a need to deal with a commingled ascii/binary data stream as stpd does i.e. they're basically single-minded tracing applications that don't need to deal with the generic ascii output that stpd does. So IMHO, the work that needs to be done is in properly modularizing librelay.c along with some small changes to the runtime to make it easy for new output-specific applications to be easily created, or if stpd is meant to be the only systemtap application, modifying it to accomodate multiple trace applications. I don't mean to modify stpd greatly. I just want the stpd provide some capability to allow new packet formats can be easily added as well as current plain ascii packets. When stpd merge pre-cpu files, it has to know which packet comes first and where is the end of current packet. The former is decided by sequence id. For an ascii packet, the latter is easily decided by the ending null terminator. But things are different for a binary packet, for there are '0's everywhere. Thus the size should be passed to stpd by some methods. That's why 'size' is introduced in the packet header. If such a header is rejected, stpd can also use some API such like get_packet_length() to get this. get_sequence_id() proposed above is also a good API candidate, but I prefer implementing the plain sequencing scheme as mandatory in bin_*write(). These APIs are implemented in the "applications". I consider stpd as an "application" but mainly the part implemented by the tapset provider or the user. I don't want to replace stpd or modify stpd greatly. If necessary, they can define packet-format-specific APIs and thus change the default behavior of stpd. They can also let stpd not do merge and later use a standalone program to process the raw data. By default, stpd will do merge with built-in plain sequencing scheme. If we can do this, then stpd can keep untouched when new binary trace formats are adopted. Moreover, different tapset provider can be free to support their own packet format without patching stpd source every time. Roughly speaking, for LKET, If bin_*write() reject stp_packet_h structure, lket_trace() will be responsible to handle {flag, size} and define get_packet_length() to tell stpd the length of current packet. But here comes another problem: is it worthy of the overhead of calculating the packet length in lket_trace()? (In reply to comment #21) > I don't mean to modify stpd greatly. I just want the stpd provide some > capability to allow new packet formats can be easily added as well as current > plain ascii packets. > > When stpd merge pre-cpu files, it has to know which packet comes first and where > is the end of current packet. The former is decided by sequence id. For an ascii > packet, the latter is easily decided by the ending null terminator. But things > are different for a binary packet, for there are '0's everywhere. Thus the size > should be passed to stpd by some methods. That's why 'size' is introduced > in the packet header. To me the obvious solution for this problem for stpd not to do merges of binary packets at all and let ascii output be logged to the console while binary data gets logged on its own per-cpu file. An application can later assemble the packets from the files in what ever way it sees fit. > Roughly speaking, for LKET, If bin_*write() reject stp_packet_h structure, > lket_trace() will be responsible to handle {flag, size} and define > get_packet_length() to tell stpd the length of current packet. But here comes > another problem: is it worthy of the overhead of calculating the packet length > in lket_trace()? This should be a job for the lket, I agree with Tom that a generic interface should not force a format on the user. I prefer the binary_write to use stp buffer instead of the relayfs buffer. disadvantage: 1. the additional overhead of using stp buffer is the memcpy, which should be neglegible and will be offset if there are multiple write calls in a single handler. advantage: 1. it could use different transport layers, not confine to relayfs. 2. it could make use the stp buffer to eliminate the no. of operations to the relayfs layers. suppose 5 binary_write is called inside a single probe handler. The overhead of using relayfs buffer directly will be larger than using stp buffer. 3. It won't have the probles that Guijian mentioned(Known bug:) in comment #9 > Roughly speaking, for LKET, If bin_*write() reject stp_packet_h structure,
> lket_trace() will be responsible to handle {flag, size} and define
> get_packet_length() to tell stpd the length of current packet. But here comes
> another problem: is it worthy of the overhead of calculating the packet length
> in lket_trace()?
for some events, e.g the backtrace, the actual length of the data to be printed
can't be known until we finished scanning from begin to end.
But this can be solved by write into the pre-reserverd len field at the
beginning of stp buffer(say, 2bytes) when we finished the scanning(vsnprintfb)
Or we can use some delimmers to tell the boundary of each events.
Anyway, all of the above solutions can be done by each individual trace
application.
So I agree with Tom now. the stp_packet_h is not necessary and it should be
application specific. maybe we need to put some hooks into the runtime and let
each app to register their format specific callback.
Created attachment 915 [details]
binary tracing prototype code modified to remove hard-coded packet header
This is a modified version of the prototype which removes the dedicated packet
header. I also modified lket_trace() to add the removed header data to the
bin_write() call instead (see below). To enable this, I added a new %n format
specifier (i.e. %n (same as 4n), %1n, %2n, %4n) which tells bin_write() to
write
the 'written' length at the specified position after it's all done. Since it
doesn't actually write any data from the param list, I also added a function
called this_event_len(), which by design does nothing but provides the user
something reasonable-sounding to put in that position in the param list.
Here's what the lket_trace() function looks like now with the header moved
into the rest of the event (it's basically a drop-in replacement for the old
code, minimally tested but it works for me):
/*
* Example scripts to use the runtime interface above
*/
%{
#define lket_trace(hookID, fmt, args...) do { \
struct timeval tv; \
do_gettimeofday (&tv); \
bin_write("%b%1b%1n|LKET:%b|%b|%b|%b|%b|%1b|%1b|"fmt"\n", \
(int64_t)next_seq(), (int64_t)STP_BIN_PACKET,
(int64_t)this_event_len(), \
(int64_t)tv.tv_sec, (int64_t)tv.tv_usec,
(int64_t)current->tgid, \
(int64_t)current->parent->pid, (int64_t)current->pid, \
(int64_t)current->thread_info->cpu, \
(int64_t)hookID, args);\
} while(0)
%}
I think it's convenient to be able to do this, but I don't think it's actually
necessary to have an event length - with a proper event description, it should
always be possible to make sense of all the data from the beginning of an
event to the end. A length makes it easier and quicker to get to the next
event, which is useful at times, so I think having %n or something like it
makes sense.
Anyway, at this point I think it would be useful to think through all the
details (e.g. event descriptions, transport/librelay changes if any, etc) by
doing and end-to-end walk-through for a couple of concrete examples. I'll
take a stab at this over the next couple of days...
> I also modified lket_trace() to add the removed header data to the
> bin_write() call instead (see below). To enable this, I added a new %n format
> specifier (i.e. %n (same as 4n), %1n, %2n, %4n) which tells bin_write() to
> write
> the 'written' length at the specified position after it's all done. Since it
> doesn't actually write any data from the param list, I also added a function
> called this_event_len(), which by design does nothing but provides the user
> something reasonable-sounding to put in that position in the param list.
I looked at the your bin_write.stp, and found bin_vsnwrite should changes they
way it handle %n specifier:
root:/home/lgl> diff -uprN bin_write.stp bin_write_mod.stp
--- bin_write.stp 2006-03-12 16:04:56.000000000 +0800
+++ bin_write_mod.stp 2006-03-13 14:21:20.000000000 +0800
@@ -141,6 +141,21 @@ int bin_vsnwrite(char *buf, size_t size,
temp = va_arg(args, int64_t);
write_len = str;
write_len_width = field_width;
+ switch(write_len_width) {
+ case 1:
+ str+=1;
+ written+=1;
+ break;
+ case 2:
+ str+=2;
+ written+=2;
+ break;
+ case 4:
+ str+=4;
+ written+=4;
+ default: // "%4b" by default
+ break;
+ }
break;
default:
if (str <= end)
oops,
Change the handler %4n & %n to:
> + case 4:
> + default: // "%4b" by default
> + str+=4;
> + written+=4;
> + break;
> + }
(In reply to comment #25) > This is a modified version of the prototype which removes the dedicated packet > header. I also modified lket_trace() to add the removed header data to the > bin_write() call instead (see below). To enable this, I added a new %n format > specifier (i.e. %n (same as 4n), %1n, %2n, %4n) which tells bin_write() to > write the 'written' length at the specified position after it's all done. I think "%n" is a really good idea for solving the packet length problem. In bin_vsnwrite(), the "written" is the total number of bytes which would be generated for the given input taking no account of "size" limit. Thus it would be different from the actual packet length when the packet is accross the subbuf boundary. If the packet length is greater than subbuf size, the packet has to be truncated. It hardly happens but should be considered. Created attachment 917 [details]
add bin_*printf() interface for comparison
This is modified based on Tom's code. I add a new binary trace interface
bin_*printf() for comparison based on discussions above.
Main changes compared with bin_*write():
1) It is similar to _stp_*printf() and not only limited to relayfs mode.
2) It gives up writing to relayfs buffer directly and uses stp_print_buf
instead.
3) It shares stp_print_buf with _stp_*print() and thus can be used together to
generate a single packet.
4) It avoids the known bug mentioned in #9
5) It is consistent with current plain ascii scheme
As for stpd, I added two new functions: get_sequence_id() and get_packet_len().
Thanks in advance for your comments.
One other request: the original ascii-vs-binary printing benchmark that prompted binary tracing should be rerun with this new code, and also sent to wcohen for inclusion in the forthcoming benchmark suite. (In reply to comment #29) > Created an attachment (id=917) > add bin_*printf() interface for comparison > > This is modified based on Tom's code. I add a new binary trace interface > bin_*printf() for comparison based on discussions above. > > Main changes compared with bin_*write(): > 1) It is similar to _stp_*printf() and not only limited to relayfs mode. > 2) It gives up writing to relayfs buffer directly and uses stp_print_buf > instead. > 3) It shares stp_print_buf with _stp_*print() and thus can be used together to > > generate a single packet. > 4) It avoids the known bug mentioned in #9 > 5) It is consistent with current plain ascii scheme > > As for stpd, I added two new functions: get_sequence_id() and get_packet_len(). > > > Thanks in advance for your comments. I noticed in the new code that you removed the sequence id, but it seems to me unless I'm missing something you still need a sequence for each event (or you should be able to use the timestamp you're already logging instead, see below), otherwise you could have event a at time 0 logged on cpu0, event b at time 1 on cpu1 and event c at time 2 on cpu 0, but appearing in the output as acb or bac. The sequence number generated by _stp_flush() would give you the ordering between events if there was only 1 event per flush, but since you're buffering them up the sequence is correct relative to the events on that cpu, but it doesn't tell you the ordering between events on different cpus. But anyway, you're logging a timestamp, so you should be able to use that for ordering.e.g. we should be able to generalize the merge in stpd and do something similar to the compare() function in qsort: static inline int __compare(long a, long b) { if (a < b) return -1; else if (a > b) return 1; return 0; } #define SEC_OFFSET (sizeof u_int_8) #define USEC_OFFSET (SEC_OFFSET + sizeof long) int compare(const void *event_a, const void *event_b) { long a = *(long *)event_a + SEC_OFFSET; long b = *(long *)event_b + SEC_OFFSET; if (a == b) return __compare(a, b); a = *(long *)event_a + USEC_OFFSET; b = *(long *)event_b + USEC_OFFSET; return __compare(a, b); } As for the bug mentioned in #9, is that really a bug? Why couldn't you just do: lket_trace(something1); lket_trace(something2); instead of log(something1); lket_trace(something2); i.e. it seems to me that the ascii and binary streams should be viewed as logically separate streams even though they actually go over the same 'wire', and since you can log ascii as well over the binary stream, why do you need to correlate with the ascii stream? (In reply to comment #31) > I noticed in the new code that you removed the sequence id, but it seems to me > unless I'm missing something you still need a sequence for each event (or you > should be able to use the timestamp you're already logging instead, see below), > otherwise you could have event a at time 0 logged on cpu0, event b at time 1 on > cpu1 and event c at time 2 on cpu 0, but appearing in the output as acb or bac. > The sequence number generated by _stp_flush() would give you the ordering > between events if there was only 1 event per flush, but since you're buffering > them up the sequence is correct relative to the events on that cpu, but it > doesn't tell you the ordering between events on different cpus. But anyway, > you're logging a timestamp, so you should be able to use that for ordering.e.g. > we should be able to generalize the merge in stpd and do something similar to > the compare() function in qsort: Correct, one of the purpose of the timestamp is for sequencing of events. Sequencing is something we would do in user space. > As for the bug mentioned in #9, is that really a bug? Why couldn't you just do: > > lket_trace(something1); > lket_trace(something2); > > instead of > > log(something1); > lket_trace(something2); > > i.e. it seems to me that the ascii and binary streams should be viewed as > logically separate streams even though they actually go over the same 'wire', > and since you can log ascii as well over the binary stream, why do you need to > correlate with the ascii stream? Correct again... I'm not sure why Gui Jian view this as a bug I do not see a need to correlate binary and ascii streams for tracing purposes. (In reply to comment #31) > As for the bug mentioned in #9, is that really a bug? Why couldn't you just do: > lket_trace(something1); > lket_trace(something2); > > instead of > > log(something1); > lket_trace(something2); > > i.e. it seems to me that the ascii and binary streams should be viewed as > logically separate streams even though they actually go over the same 'wire', > and since you can log ascii as well over the binary stream, why do you need to > correlate with the ascii stream? > I think if the binary trace interface can be integrated into runtime and provided to the user just like _stp_*printf(), we should allow the user to use them freely. For the user's convenience, we should try minimize the constraints about where, how and in what order to use it. The user may has no idea about stp_*flush() or the subtle difference between _stp_*printf() and bin_*printf(). It reasonable for him to require what he gets is what he codes :-) This is why I said it is a bug. Anyway, if we don't need to correlate binary stream with ascii stream at all or don't assume the user to use them in similar ways, I agree it should not be considered as a bug. (In reply to comment #31) > I noticed in the new code that you removed the sequence id, but it seems to me > unless I'm missing something you still need a sequence for each event (or you > should be able to use the timestamp you're already logging instead, see below), > otherwise you could have event a at time 0 logged on cpu0, event b at time 1 on > cpu1 and event c at time 2 on cpu 0, but appearing in the output as acb or bac. > The sequence number generated by _stp_flush() would give you the ordering > between events if there was only 1 event per flush, but since you're buffering > them up the sequence is correct relative to the events on that cpu, but it > doesn't tell you the ordering between events on different cpus. The original bin_*write() will generate a binary packet each time it is called, because it forces a binary packet header. But the design of bin_*printf() is different. It only writes data into the stp-print-buffer and lets _stp_*flush() handle the sequence as you meantioned. It is to be consistent with _stp_*printf() and to get rid of the limit to relayfs mode. > > you're logging a timestamp, so you should be able to use that for ordering.e.g. > we should be able to generalize the merge in stpd and do something similar to > the compare() function in qsort > For LKET, it is good to use timestamp for merging and sequence information is redundant. But in general, taking data space into consideration, it is luxurious to merge by timestamp(8byte for ppc32 or 16byte for ppc64). So I agree the comparison is modularized but more schemes should be considered. I think one important question is: what we want the binary trace interface to be? something like _stp_*sprintf() except that it stores data in binary format, or something totally controlled by the user to store data as they want? I think this is also the main difference between bin_*printf() and bin_*write(). (In reply to comment #33) > (In reply to comment #31) > > As for the bug mentioned in #9, is that really a bug? Why couldn't you just do: > > lket_trace(something1); > > lket_trace(something2); > > > > instead of > > > > log(something1); > > lket_trace(something2); > > > > i.e. it seems to me that the ascii and binary streams should be viewed as > > logically separate streams even though they actually go over the same 'wire', > > and since you can log ascii as well over the binary stream, why do you need to > > correlate with the ascii stream? > > > I think if the binary trace interface can be integrated into runtime and > provided to the user just like _stp_*printf(), we should allow the user to use > them freely. For the user's convenience, we should try minimize the constraints > about where, how and in what order to use it. The user may has no idea about > stp_*flush() or the subtle difference between _stp_*printf() and bin_*printf(). > It reasonable for him to require what he gets is what he codes :-) This is why I > said it is a bug. > Anyway, if we don't need to correlate binary stream with ascii stream at all or > don't assume the user to use them in similar ways, I agree it should not be > considered as a bug. Just my opinion, but as a user when I think of printf() I think of data going to a console, and when I think of log() or trace(), I think of binary data which I probably don't want to go to the console. If I want to trace to the console, I'd just use printf(). So maybe there should be a clear difference made to the user between the two. log(), though, seems like it may be misnamed i.e. it sounds like a binary function but it's actually not. The printf() part of bin_printf() is ok for me because it's an analogue of printf() but since it's prefixed with bin_ it's clear to me what it's for. (In reply to comment #34) > (In reply to comment #31) > > I noticed in the new code that you removed the sequence id, but it seems to me > > unless I'm missing something you still need a sequence for each event (or you > > should be able to use the timestamp you're already logging instead, see below), > > otherwise you could have event a at time 0 logged on cpu0, event b at time 1 on > > cpu1 and event c at time 2 on cpu 0, but appearing in the output as acb or bac. > > The sequence number generated by _stp_flush() would give you the ordering > > between events if there was only 1 event per flush, but since you're buffering > > them up the sequence is correct relative to the events on that cpu, but it > > doesn't tell you the ordering between events on different cpus. > The original bin_*write() will generate a binary packet each time it is called, > because it forces a binary packet header. But the design of bin_*printf() is > different. It only writes data into the stp-print-buffer and lets _stp_*flush() > handle the sequence as you meantioned. It is to be consistent with > _stp_*printf() and to get rid of the limit to relayfs mode. > > > > you're logging a timestamp, so you should be able to use that for ordering.e.g. > > we should be able to generalize the merge in stpd and do something similar to > > the compare() function in qsort > > > For LKET, it is good to use timestamp for merging and sequence information is > redundant. But in general, taking data space into consideration, it is luxurious > to merge by timestamp(8byte for ppc32 or 16byte for ppc64). So I agree the > comparison is modularized but more schemes should be considered. > I agree, it should be up to the application to log whatever it needs to do it's merge, and the compare function should also be tailored for the application, or were you thinking of something else? > I think one important question is: what we want the binary trace interface to > be? something like _stp_*sprintf() except that it stores data in binary format, > or something totally controlled by the user to store data as they want? I think > this is also the main difference between bin_*printf() and bin_*write(). > I thought it was supposed to be something that would give the user complete control over a trace. At least, if it's not too hard, why not make it as generic and flexible as possible? Now that the header is gone, it's almost there; the only other thing that needs to be removed is the sequence number being generated by _stp_print_flush(), since it's redundant if you're tracing your own sequences or timestamps. For stpd, an int still needs to be in that same position, in order to coexist with the ascii stream, so maybe for a binary event, you can define an _stp_bin_print_flush() that doesn't log the sequence counter, add a %b to the beginning of your event and log a 0 there in place of the sequence counter, which is never used as a sequence counter in stpd ascii, and store your own sequence or timestamp in your event as you already do. So your code stays basically the same and you don't have to log a redundant sequence counter, you get the transport-indepence, and other applications that don't have to coexist with an ascii stream can call bin_printf() and won't have to log the sequence counter or anything other than what they want. You could then also get rid of the binary/ascii flag since you can use the 0 sequence counter for that. Maybe there's a better way, or maybe it's not worth the effort, but it would be nice to be able to have a trace interface that would allow applications to log only the data they want and nothing more. (In reply to comment #35) > Just my opinion, but as a user when I think of printf() I think of data going to > a console, and when I think of log() or trace(), I think of binary data which I > probably don't want to go to the console. If I want to trace to the console, > I'd just use printf(). So maybe there should be a clear difference made to the This is my opinion as well. I personaly dont think there is any use to print binary data to the console. I guess one can make stpd be able to decode the data and print it to the console in formated fashion, but what would be the point of that. Thats what printf() is for. > user between the two. log(), though, seems like it may be misnamed i.e. it > sounds like a binary function but it's actually not. The printf() part of > bin_printf() is ok for me because it's an analogue of printf() but since it's > prefixed with bin_ it's clear to me what it's for. In my opinion the hole bin_printf() is a bit missleading since to me the "print" part implies that we are using printable characters. I like using write() or bin_write() because this implies that the data is just beeing writen to a buffer with out any form of formatting involved. I've been trying to catch up with all the comments. I don't like the idea of having a "binary output" and an "ascii output". There should be one output. And a way to always write console messages. We mostly have that now. With procfs, if you write binary data (including mixed binary and ascii), you want it to go to a file or to another application which will display it. You can redirect to a file with "stap foo.stp > out" or "stap -o out foo.stp" Or pipe it to another app. Once we have binary output, I argee that we can remove the timestamp and interleaving code for relayfs. The output would be the per-cpu files. Nothing would get displayed on the console from the bulk data in relayfs. So redirecting the output would be meaningless for relayfs. Right now, we have two functions warn() and error() that always write to the console. Actually stpd writes their messages to stderr. We could easily make log() do the same, without signalling an error. Or call it print_console()? log() is obsolete and should be deprecated or redefined to do something useful. So in the end, I would still like to see things as simple as possible with only the following print() printf() (which recognizes %b binary extensions) print_console() - always short ASCII text messages. Should not be used for data from the script output. (In reply to comment #38) > I've been trying to catch up with all the comments. > > I don't like the idea of having a "binary output" and an "ascii output". There > should be one output. And a way to always write console messages. We mostly have > that now. > > With procfs, if you write binary data (including mixed binary and ascii), you > want it to go to a file or to another application which will display it. You can > redirect to a file with "stap foo.stp > out" or "stap -o out foo.stp" Or pipe it > to another app. > > Once we have binary output, I argee that we can remove the timestamp and > interleaving code for relayfs. The output would be the per-cpu files. > Nothing would get displayed on the console from the bulk data in relayfs. > So redirecting the output would be meaningless for relayfs. > > Right now, we have two functions warn() and error() that always write to the > console. Actually stpd writes their messages to stderr. We could easily make > log() do the same, without signalling an error. Or call it print_console()? > > log() is obsolete and should be deprecated or redefined to do something useful. > > So in the end, I would still like to see things as simple as possible with only > the following > > print() > printf() (which recognizes %b binary extensions) > print_console() - always short ASCII text messages. Should not be used for data > from the script output. > It sounds good to me - if I understand correctly, any print() or printf() data for a script invoked as "stap foo.stp" goes to the console or can be redirected to a file, while with "stap -b foo.stp" it goes to per-cpu files, and any print_console() data or warnings basically goes through the OOB channel and isn't part of the data. The one question I have is that if the script is run with -b, it needs to use timestamps or a sequence for ordering, and right now, there's the code in _stp_print_flush() that writes one, so everything will work fine if that's there, but if we wanted to remove it, scripts that don't do their own timestamping and are run with -b won't be ordered correctly. So it seems to me we'd need to either not worry about it and mandate the timestamp for -b, or by default have the timestamp generated but provide an alternative function that's exactly the same, but doesn't write the timestamp, for scripts that want to manage it themselve e.g. __printf(). Subject: Re: binary tracing On Tue, 2006-03-14 at 21:56 +0000, zanussi at us dot ibm dot com wrote: > It sounds good to me - if I understand correctly, any print() or printf() data > for a script invoked as "stap foo.stp" goes to the console or can be redirected > to a file, while with "stap -b foo.stp" it goes to per-cpu files, and any > print_console() data or warnings basically goes through the OOB channel and > isn't part of the data. Yes. > The one question I have is that if the script is run > with -b, it needs to use timestamps or a sequence for ordering, and right now, > there's the code in _stp_print_flush() that writes one, so everything will work > fine if that's there, but if we wanted to remove it, scripts that don't do their > own timestamping and are run with -b won't be ordered correctly. So it seems to > me we'd need to either not worry about it and mandate the timestamp for -b, or > by default have the timestamp generated but provide an alternative function > that's exactly the same, but doesn't write the timestamp, for scripts that want > to manage it themselve e.g. __printf(). I assume relayfs users are probably using systemtap as part of a larger tool and they want the files in a particular format. So I don't think we should enforce any sequence number or timestamp. In that case, the output would be a group of files, one per cpu. There would be no automatic combining that into a single file. Perhaps "stap -b -o xyzzy foo.stp" would create files xyzzy.[cpu_number]. Is there is a case for automatically (as an option or default) generating timestamps or sequence numbers? I put that code in there to do it but that was when we were trying to make relayfs do everything. It seems to me that if you want sequential output, you would use procfs. But, I rarely use relayfs so I may be misunderstanding the issue. (In reply to comment #40) > Subject: Re: binary tracing > > On Tue, 2006-03-14 at 21:56 +0000, zanussi at us dot ibm dot com wrote: > > > It sounds good to me - if I understand correctly, any print() or printf() data > > for a script invoked as "stap foo.stp" goes to the console or can be redirected > > to a file, while with "stap -b foo.stp" it goes to per-cpu files, and any > > print_console() data or warnings basically goes through the OOB channel and > > isn't part of the data. > > Yes. > > > The one question I have is that if the script is run > > with -b, it needs to use timestamps or a sequence for ordering, and right now, > > there's the code in _stp_print_flush() that writes one, so everything will work > > fine if that's there, but if we wanted to remove it, scripts that don't do their > > own timestamping and are run with -b won't be ordered correctly. So it seems to > > me we'd need to either not worry about it and mandate the timestamp for -b, or > > by default have the timestamp generated but provide an alternative function > > that's exactly the same, but doesn't write the timestamp, for scripts that want > > to manage it themselve e.g. __printf(). > > I assume relayfs users are probably using systemtap as part of a larger > tool and they want the files in a particular format. So I don't think we > should enforce any sequence number or timestamp. In that case, the > output would be a group of files, one per cpu. There would be no > automatic combining that into a single file. Perhaps "stap -b -o xyzzy > foo.stp" would create files xyzzy.[cpu_number]. > > Is there is a case for automatically (as an option or default) > generating timestamps or sequence numbers? I put that code in there to > do it but that was when we were trying to make relayfs do everything. It > seems to me that if you want sequential output, you would use procfs. > I think so. If you use the -b option with a script that doesn't print its own binary timestamps, it should still work (as it does now). Created attachment 936 [details]
prototype generic binary tracer using generic binary tracing interface
To test out the prototype binary trace interface, I wrote a prototype
generic binary tracer that uses it, which I've attached here. It's
based on Jian Gui's latest bin_write.stp, but at his and Jose's
suggestion, I've split things up and moved the runtime part to
runtime/bin_write.c and renamed it, etc.
Basically, the trace format it implements uses the first byte to
identify the event, and to start off, there's only one event, which is
the 'define new event' event, used to define new events during the
trace. read() and open() are instrumented here as examples.
Event types are defined within the probe (in begin()), actual probe
hits log the matching binary events in their handlers, and it all gets
generically sorted out at the end of the run in stpd. This version
replaces the merge() in stpd with a function called print_binary()
that also merges the output but uses the event descriptions to
translate the binary events into generic text versions that are then
displayed to the console. Of course, the output is generic and
therefore not pretty (see below for example), but it shouldn't be too
hard to add 'format strings' to event descriptions that would be used
in place of the generically formatted print function. Also, it should
be easy for anyone to write their own replacement for print_binary.c,
which wouldn't need to print at all e.g. import_into_db_and_query.c
(the generic descriptions should make this easy) or
postprocess_and_graph.c.
It also makes changes to the runtime, removing the need to manipulate
the relayfs buffers in bin_write - basically what it does is use
relay_reserve() to get 'buffers' that replace the _stp_pbuf static
buffers.
Again, it's only been minimally tested and probably breaks non -b
tracing, and there are most likely other problems; this is all just
prototype code meant for demonstration and discussion purposes, so
buyer beware.
Some thoughts on the interface after having gone through the exercise
of actually logging binary data using it...
- keeping data positions straight when dealing with long format
strings is really a pain, but I guess no different from regular
printf in that respect. I can't really think of a better way,
though, other than having a way to break up the logging for an
event into separate calls.
- defining event descriptions is also a pain; there should be some
higher-level way of doing it e.g. define_event("event name"),
add_field("field name", type). Of course this is a format-specific
thing, so not really on-topic here.
- tailoring the printed trace output could be accomplished by
supplying a format string to the event description as mentioned
above. Wanting to do this at all is an application-specific thing
though i.e. only a generically-printing back-end would want it, so
again, it's probably off-topic for a generic trace interface.
- there doesn't seem to be a way to log variable-length data, other
than strings. To log non-string variable-length data e.g. arrays or
the contents of target variables, there could be some function like
bin_log(object). This could in turn be defined as e.g
__bin_log(object, length). This would log a length followed by
'length' number of bytes from the object. If an object already has
a length associated with it e.g. a systemtap array, bin_log(array)
could be used and internally translated to __bin_log(array,
array.len), which would log the count of elements in the array
followed by the array elements. To log the contents of a target
struct, you could say for example bin_log(*$file) if $file referred
to a struct file *, which would map to the same thing as
__bin_log($file, sizeof(*$file)); I assume the dwarf info would make
that possible. To make all this accessible to printf() we could add
another format specifier, say '%o'. If the object passed in has
length information associated with it, '%o' would be able to
automatically log the object contents. In cases where there is no
associated length for an object, or you want to log a subset, maybe
there could be a function called make_print_object(object, len).
This would create an object that could then be passed to printf()
and have the necessary length information associated with it.
- For the binary vs. console stream, maybe it makes sense to add an
fprintf() and define printf() and print_console() in terms of that,
so everything stays the same except that binary data could be made
to go to a separate file if wanted, by using fprintf(). printf()
would remain as it is but be would be defined as fprintf(stdout),
print_console() would internally be fprintf(stderr) and binary data
could be made to go to file by fprintf(stdfile). There wouldn't be
any corrsponding fopen(), stdfile would be the only file stream
available and that would refer to normal stap file output.
Example output:
<0> read: id: 1, timeval: 1142896381.115226, tgid: 13495, parent pid: 13408,
pid: 13495, name: read, fd: 3, count: 8192,
<0> read: id: 1, timeval: 1142896381.115233, tgid: 13495, parent pid: 13408,
pid: 13495, name: read, fd: 3, count: 8192,
<0> open: id: 2, timeval: 1142896381.875678, tgid: 2739, parent pid: 2256, pid:
2739, filename: /dev/hdc, execname: hald-addon-stor,
<0> read: id: 1, timeval: 1142896383.876072, tgid: 3532, parent pid: 3529, pid:
3532, name: read, fd: 3, count: 16384,
<1> read: id: 1, timeval: 1142896383.876199, tgid: 13495, parent pid: 13408,
pid: 13495, name: read, fd: 3, count: 8192,
<0> open: id: 2, timeval: 1142896383.879811, tgid: 2739, parent pid: 2256, pid:
2739, filename: /dev/hdc, execname: hald-addon-stor,
events_processed: 8
Created attachment 937 [details]
somehow bin_write.c didn't get into the prototype patch - here's a new patch
Oops, somehow bin_write.c didn't get included, here's an updated tarball with
the right patch.
This prototype works in my i686/2.6.15.6 with systemtap-20060318. But when I used _stp_*print() in my script, system always crashed. I found the reason is that _stp_pbuf[cpu] pointers are not initialized before they are used in _stp_sprintf() as in bin_swrite(). I think the following small patch should be integrated into this prototype. If I am wrong with something, please let me know. diff -Nur src-20060318.bin_write/runtime/runtime.h src-modified/runtime/runtime.h --- src-20060318.bin_write/runtime/runtime.h 2006-03-23 01:38:24.000000000 +0800 +++ src-modified/runtime/runtime.h 2006-03-23 01:42:27.000000000 +0800 @@ -59,8 +59,8 @@ #endif /* RELAYFS */ #include "print.c" -#include "string.c" #include "bin_write.c" +#include "string.c" #include "arith.c" #include "copy.c" #include "sym.h" diff -Nur src-20060318.bin_write/runtime/string.c src-modified/runtime/string.c --- src-20060318.bin_write/runtime/string.c 2005-11-01 03:23:21.000000000 +0800 +++ src-modified/runtime/string.c 2006-03-23 01:42:27.000000000 +0800 @@ -59,9 +59,12 @@ int num; va_list args; if (str == _stp_stdout) { + char *buf; + int size; int cpu = smp_processor_id(); - char *buf = &_stp_pbuf[cpu][STP_PRINT_BUF_START] + _stp_pbuf_len[cpu]; - int size = STP_PRINT_BUF_LEN -_stp_pbuf_len[cpu] + 1; + check_null_buffer(cpu); + buf = &_stp_pbuf[cpu][STP_PRINT_BUF_START] + _stp_pbuf_len[cpu]; + size = STP_PRINT_BUF_LEN -_stp_pbuf_len[cpu] + 1; va_start(args, fmt); num = vsnprintf(buf, size, fmt, args); va_end(args); (In reply to comment #44) > This prototype works in my i686/2.6.15.6 with systemtap-20060318. > > But when I used _stp_*print() in my script, system always crashed. > I found the reason is that _stp_pbuf[cpu] pointers are not > initialized before they are used in _stp_sprintf() as in bin_swrite(). > > I think the following small patch should be integrated into this > prototype. If I am wrong with something, please let me know. > > diff -Nur src-20060318.bin_write/runtime/runtime.h src-modified/runtime/runtime.h > --- src-20060318.bin_write/runtime/runtime.h 2006-03-23 01:38:24.000000000 +0800 > +++ src-modified/runtime/runtime.h 2006-03-23 01:42:27.000000000 +0800 > @@ -59,8 +59,8 @@ > #endif /* RELAYFS */ > > #include "print.c" > -#include "string.c" > #include "bin_write.c" > +#include "string.c" > #include "arith.c" > #include "copy.c" > #include "sym.h" > diff -Nur src-20060318.bin_write/runtime/string.c src-modified/runtime/string.c > --- src-20060318.bin_write/runtime/string.c 2005-11-01 03:23:21.000000000 +0800 > +++ src-modified/runtime/string.c 2006-03-23 01:42:27.000000000 +0800 > @@ -59,9 +59,12 @@ > int num; > va_list args; > if (str == _stp_stdout) { > + char *buf; > + int size; > int cpu = smp_processor_id(); > - char *buf = &_stp_pbuf[cpu][STP_PRINT_BUF_START] + > _stp_pbuf_len[cpu]; > - int size = STP_PRINT_BUF_LEN -_stp_pbuf_len[cpu] + 1; > + check_null_buffer(cpu); > + buf = &_stp_pbuf[cpu][STP_PRINT_BUF_START] + _stp_pbuf_len[cpu]; > + size = STP_PRINT_BUF_LEN -_stp_pbuf_len[cpu] + 1; > va_start(args, fmt); > num = vsnprintf(buf, size, fmt, args); > va_end(args); Yeah, that should work for now. The buffers should be set up when the tranport is intialized anyway, so check_null_buffer() should go away - I'll do that in a next version. For this, I only modified what needed to be modified for the demo, so didn't go through the runtime making sure everything else was right. I compared the overhead of bin_write vs. _stp_printf. Here are the results. The values are the average time per getuid() call in the loop, as shown in test_getuid.c. Results on ppc64/2.6.9-34.EL: No stap: 333 ns Empty probe: 1492 ns 100 char binary write: 3067 ns 100 char ascii printf: 2849 ns 5 integers binary write: 2045 ns 5 integers ascii printf: 5077 ns 10 integers binary write: 2451 ns 10 integers ascii printf: 8661 ns Results on i686/2.6.15.6: No stap: 570 ns Empty probe: 2258 ns 100 char binary write: 3486 ns 100 char ascii printf: 2774 ns 5 integers binary write: 2427 ns 5 integers ascii printf: 3447 ns 10 integers binary write: 2984 ns 10 integers ascii printf: 4522 ns As we expected, the binary trace interface is more effective than _stp_printf when writing integers. When the total of interger arguments increases, such overhead decrease becomes especially significant. But bin_write is not as good as _stp_sprintf when writing only characters. It is a little strange. I have tried to do this with the new benchmark framework (http://sourceware.org/ml/systemtap/2006-q1/msg00790.html), but still have problem to let it work with current runtime patch by now. Another problem I encounter is that system always crashes when I increase the LOOPS in test_getuid.c, e.g. 100000LL. I am still working on these. Here are the stp script and the C file I used to do the comparison. I use another shell script to run "stap -g -b -DTESTNUM=n getuid.stp -c ./test_getuid" iteratively. /* getuid.stp */ function trace_func() %{ #if (TESTNUM==1) bin_write ("1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"); #elif (TESTNUM==2) _stp_printf ("1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"); #elif (TESTNUM==3) bin_write("%1b%2b%4b%d%8b", 0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL); #elif (TESTNUM==4) _stp_printf("%d%d%d%d%lld", 0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL); #elif (TESTNUM==5) bin_write("%1b%2b%4b%d%8b%1b%2b%4b%d%8b", 0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL, 0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL); #elif (TESTNUM==6) _stp_printf("%d%d%d%d%lld%d%d%d%d%lld", 0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL, 0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL); #else // empty probe ; #endif %} probe kernel.function("sys_getuid") { trace_func() } /* test_getuid.c */ #include <sys/types.h> #include <sys/time.h> #include <unistd.h> #define LOOPS 10000LL int main(void) { struct timeval tv[2]; int i; double time; gettimeofday(&tv[0], NULL); for(i=0; i<LOOPS; i++){ getuid(); } gettimeofday(&tv[1], NULL); time = ((tv[1].tv_sec - tv[0].tv_sec)*1000000.0 + (tv[1].tv_usec - tv[0].tv_usec))*1000/LOOPS; printf("%d ns\n", (int)time); } See http://sources.redhat.com/ml/systemtap/2006-q2/msg00084.html I've checked in a new printf() to implement the features discussed in this PR. If I missed anything, let me know. When writing lots of numbers, the speed increase is huge. There are still some optimizations that could be done, including Tom's proposed translator optimization to inline some printfs. Nothing left to do here. There is 3217 for compiled binary tracing as a further enhancement. |