This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: consistent printf race conditions
Scripts below. Unfortunately, part of what I'm using this script for is figuring out what the workload is, so I can't provide you with a reliable way to replicate it. I'll try to include some of the surrounding output on the offchance that might help. FWIW, this is SystemTap 1.3.
-----
probe syscall.close.return {
uid = uid()
track_uid = uid == 42 // normally does more checks like this
if ((execname() != "stap" && execname() != "salt-minion" && execname() != "stap-io" && track_uid) && $return == 0) {
// checks if pid and fd are in an associative array
if (!is_fd_blacklisted(pid(), $fd)) {
printf("{\"execname\":\"%s\",\"fd\":%d,\"op\":\"%s\",\"pid\":%d,\"ppid\":%d,\"return\":%d,\"timestamp\":%d,\"uid
\":%d}\n",clean_string(execname()),$fd,clean_string("CLOSE"),pid(),ppid(),$return,gettimeofday_ms(),uid())
}
unblacklist_fd(pid(), $fd)
delete write_fds[pid(), $fd]
delete read_fds[pid(), $fd]
}
}
-----
probe kernel.function("sys_wait4").call {
uid = uid()
track_uid = uid == 0
if ((execname() != "stap" && execname() != "salt-minion" && execname() != "stap-io" && track_uid)) {
printf("{\"execname\":\"%s\",\"op\":\"%s\",\"options\":%d,\"pid\":%d,\"ppid\":%d,\"timestamp\":%d,\"uid\":%d}\n",cle
an_string(execname()),clean_string("WAITPID_CALL"),$options,$pid,ppid(),gettimeofday_ms(),uid())
}
}
-----
Here's some output surrounding one of those errors (changed some values for security):
-----
{"execname":"bash","fd":3,"op":"CLOSE","pid":13562,"ppid":13561,"return":0,"timestamp":1347749678387,"uid":1}
{"execname":"bash","op":"PIPE","pid":13562,"pipe0":3,"pipe1":4,"ppid":13561,"return":0,"timestamp":1347749678387, "uid":1}
{"execname":"bash","op":"FORK","pid":13562,"ppid":13561,"return":15121,"timestamp":1347749678387,"uid":1}
{"execname":"bash","fd":4,"op":"CLOSE","pid":15121,"ppid":13562,"return":0,"timestamp":1347749678387,"uid":1}
{"execname":"bash","fd":3,"op":"CLOSE","pid":13562,"ppid":13561,"return":0,"timestamp":1347749678388,"uid":1}
{"execname":"bash","fd":4,"op":"CLOSE","pid":13562,"ppid":13561,"return":0,"timestamp":1347749678388,"uid":1}
{"execname":"bash","op":"WAITPID_CALL","options":10,"pid":-1,"ppid":13561,"timestamp":134{"execname":"bash","fd": 3,"op":"CLOSE","pid":15121,"ppid":13562,"return":0,"timestamp":1347749678388,"uid":1}
{"args":"\"ls\" \"--color=tty\" \"-l\" \"t\"\"Output/\"","execname":"bash","filename":"<unknown>","op":"EXECVE_CALL","pid":15121,"ppid":13562,"pwd":"somepath","timestamp":1347749678388,"uid":1}
{"execname":"ls","fd":3,"op":"CLOSE","pid":15121,"ppid":13562,"return":0,"timestamp":1347749678388,"uid":1}
{"execname":"ls","op":"EXECVE_RETURN","pid":15121,"ppid":13562,"return":0,"timestamp":1347749678388,"uid":1}
On Oct 24, 2012, at 7:27 PM, Josh Stone <jistone@redhat.com> wrote:
> On 10/24/2012 01:40 PM, Nicholas Murphy wrote:
>> I've been consistently seeing the following race condition: a printf from a close() system call pre-empts printing of a sys_wait4 hook. Here's the sys_wait4 printf code:
>>
>> printf("{\"execname\":\"%s\",\"op\":\"%s\",\"options\":%d,\"pid\":%d,\"ppid\":%d,\"timestamp\":%d,\"uid\":%d}\n",clean_string(execname()),clean_string("WAITPID_CALL"),$options,$pid,ppid(),gettimeofday_ms(),uid())
>>
>> ...and here's an example of what I end up seeing printed:
>>
>> {"execname":"bash","op":"WAITPID_CALL","options":10,"pid":-1,"ppid":13561,"timestamp":134{"execname":"bash","fd":3,"op":"CLOSE","pid":15121,"ppid":13562,"return":0,"timestamp":1347749678388,"uid":56441}
>>
>> I don't see anything in the printf that would cause it to be pre-empted by another printf. Any idea why I'm seeing this and what I could do to avoid it? It seems to always be waitpid conflicting with close, so I could perhaps add an access to a common associative array to force synchronization between the two?
>
> Can you provide a more complete script which demonstrates the error,
> with at least the two close and wait probes? And what sort of workload
> do you run where it consistently fails?
>
> I don't know why it would do this, but it will help if I can reproduce
> the corruption locally.
>
> It would also help if you could attach the output of stap-report.
>
> Thanks,
> Josh