Bug 6932

Summary:	c->busy can be non-atomic.
Product:	systemtap	Reporter:	Masami Hiramatsu <mhiramat>
Component:	translator	Assignee:	Unassigned <systemtap>
Status:	RESOLVED WONTFIX
Severity:	normal	CC:	fche
Priority:	P2
Version:	unspecified
Target Milestone:	---
Host:		Target:
Build:		Last reconfirmed:
Attachments:	[PATCH] make busy nonatomic [PATCH] make busy nonatomic with mb/wmb

Description Masami Hiramatsu 2008-10-01 20:38:00 UTC

if it is just for waiting handlers, synchronize_sched() is enough (at least for
kprobe-based handlers).

Comment 1 Frank Ch. Eigler 2008-10-01 20:43:05 UTC

We can put in a synchronize_sched() before or after or even into
the busy-testing loop.  I see no reason to *remove* the busy-testing
loop though.

Comment 2 Masami Hiramatsu 2008-10-01 20:48:07 UTC

Hmm, right.
I just think we can remove atomic_inc/dec(c->busy) from kprobe-based handlers
(and other handler which don't cause context switching)

Comment 3 Frank Ch. Eigler 2008-10-01 20:49:55 UTC

I would prefer to keep c->busy, since it also functions as a
reentrancy-prevention mechanism, not just a shutdown-synchronization
one.

Comment 4 Masami Hiramatsu 2008-10-01 20:56:41 UTC

(In reply to comment #3)
> I would prefer to keep c->busy, since it also functions as a
> reentrancy-prevention mechanism, not just a shutdown-synchronization
> one.

Oops, Indeed.

BTW, kprobes itself has reentrancy checking routine, however other
probes(markers/timer/etc.) don't have that(and also don't exclude each other).

So it should be suspended until solving reentrant-probing support...

Comment 5 Masami Hiramatsu 2008-11-13 00:46:48 UTC

(In reply to comment #3)
> I would prefer to keep c->busy, since it also functions as a
> reentrancy-prevention mechanism, not just a shutdown-synchronization
> one.

By the way, for reentrancy-prevention, we don't need an atomic-operation,
because the reentrance always occurs on the same processor...

Comment 6 Frank Ch. Eigler 2008-11-13 14:39:07 UTC

> By the way, for reentrancy-prevention, we don't need an atomic-operation,
> because the reentrance always occurs on the same processor...

Yes, but a nonatomic increment would not be protected against
interrupts/preemption.

Comment 7 Masami Hiramatsu 2008-11-13 16:05:09 UTC

hmm, if we use per-cpu variable for nesting flag, I think 
interrupts don't need to check it atomically.(as far as I know,
the preemption is disabled there...)

The operation can be separated into 4 phases
---
1. read flag
2. write flag=1 if (flag == 0)
3. doing some
4. write flag=0
---

If an interrupt occurs between 2 and 4,
---
1. read flag
2. write flag=1 if (flag == 0)
>>
 1. read flag
 2. return due to flag == 1
<<
3. doing some
>>
 1. read flag
 2. return due to flag == 1
<<
4. write flag=0
---
This case, the flag is operated correctly.

If an interrupt occurs between 1 and 2,
---
1. read flag
>>
 1. read flag
 2. write flag=1 if (flag == 0)
 3. doing some
 4. write flag=0
<<
2. write flag=1 if (flag == 0)
3. doing some
4. write flag=0
---
Even this case, the flag is correctly operated, because this 
case is same as an interrupt happens right before 1.

Comment 8 Masami Hiramatsu 2008-11-13 17:37:25 UTC

Hmm, however, the oprofile showed this did not so much improve performance on
ia64...

0.7.2 without this change:
---
000000000000cc20 9672      1.0968  stap_780d862f8ed31f54314ad730113ad90d_401.ko 
stap_780d862f8ed31f54314ad730113ad90d_401 enter_kprobe_probe
  000000000000cc20 589       6.0897
  000000000000cc40 48        0.4963
  000000000000cc58 36        0.3722
  000000000000cc60 4         0.0414
  000000000000cc64 38        0.3929
  000000000000cc70 272       2.8122
  000000000000cc80 176       1.8197
  000000000000cc84 217       2.2436
  000000000000cc90 39        0.4032
  000000000000cc98 47        0.4859
  000000000000cca4 40        0.4136
  000000000000ccb0 39        0.4032
  000000000000ccb8 60        0.6203
  000000000000ccc4 35        0.3619
  000000000000ccd0 38        0.3929
  000000000000d174 7869     81.3586
  000000000000d178 49        0.5066
  000000000000d180 43        0.4446
  000000000000d188 33        0.3412

git-with this change:
---
000000000000ce80 8448      0.9590  stap_9654f52ad4d73402acf5522fa20ffae3_401.ko 
stap_9654f52ad4d73402acf5522fa20ffae3_401 enter_kprobe_probe
  000000000000ce80 540       6.3920
  000000000000cea0 39        0.4616
  000000000000ceb8 40        0.4735
  000000000000cec4 48        0.5682
  000000000000ced0 230       2.7225
  000000000000cee0 161       1.9058
  000000000000cee4 200       2.3674
  000000000000cef0 37        0.4380
  000000000000cef8 44        0.5208
  000000000000cf04 34        0.4025
  000000000000cf10 31        0.3670
  000000000000cf18 42        0.4972
  000000000000cf24 36        0.4261
  000000000000cf30 43        0.5090
  000000000000d3d4 6801     80.5043
  000000000000d3d8 53        0.6274
  000000000000d3e0 33        0.3906
  000000000000d3e8 35        0.4143
  000000000000db20 1         0.0118

So, this issue is not so important...

Comment 9 Masami Hiramatsu 2009-02-27 21:25:41 UTC

Created attachment 3782 [details]
[PATCH] make busy nonatomic

This patch changes c->busy to nonatomic, according to above logic.
Previously, I could not find effective difference because the machine I used
was not enough large. maybe, if we test it on enough large SMP machine, it
could show difference.

Please test it.
Thanks,

Comment 10 Josh Stone 2009-02-27 22:04:21 UTC

(In reply to comment #9)
> Created an attachment (id=3782)
> [PATCH] make busy nonatomic

It should be fine to make this non-atomic, but I think it needs barriers to
preserve memory order.  Otherwise the compiler/architecture could reorder the
writes into the context pointer.  It might turn into:

  1. write some c->fields
>>> interrupt with a probe -> step #1 is overwritten
  2. write c->busy = 1
  3. write more c->fields, run the probe, etc.

Comment 11 Masami Hiramatsu 2009-02-27 22:16:35 UTC

(In reply to comment #10)
> (In reply to comment #9)
> > Created an attachment (id=3782)
> > [PATCH] make busy nonatomic
> 
> It should be fine to make this non-atomic, but I think it needs barriers to
> preserve memory order.  Otherwise the compiler/architecture could reorder the
> writes into the context pointer.  It might turn into:
> 
>   1. write some c->fields
> >>> interrupt with a probe -> step #1 is overwritten
>   2. write c->busy = 1
>   3. write more c->fields, run the probe, etc.

Oops, right.
Thank you for pointing it out!
I'll update it.

Comment 12 Masami Hiramatsu 2009-02-28 00:58:33 UTC

Created attachment 3784 [details]
[PATCH] make busy nonatomic with mb/wmb

Here, I added wmb() and mb() for protecting context members updating.

Thank you,

Comment 13 Josh Stone 2009-02-28 02:22:03 UTC

(In reply to comment #12)
> Created an attachment (id=3784)
> [PATCH] make busy nonatomic with mb/wmb
> 
> Here, I added wmb() and mb() for protecting context members updating.

Thanks, looks good to me.

Comment 14 Frank Ch. Eigler 2009-02-28 02:51:31 UTC

If we can't find any measurement that show a meaningful improvement
due to this patch, I'd rather not take the risk that we're forgetting
some barrier or other, and leave the code as is.

Comment 15 Josh Stone 2009-03-01 00:55:35 UTC

(In reply to comment #14)
> If we can't find any measurement that show a meaningful improvement
> due to this patch, I'd rather not take the risk that we're forgetting
> some barrier or other, and leave the code as is.

I don't think that atomic operations will guarantee memory order either, so
we're probably already missing barriers...

Comment 16 Frank Ch. Eigler 2009-03-03 04:07:28 UTC

> I don't think that atomic operations will guarantee memory order either, so
> we're probably already missing barriers...


For the "busy" flag, I don't think ordering issues arise if we continue to
use the atomic.h API.  Concurrent reads/writes are SMP-synchronized - that's 
the whole point.  Note that we only write c->fields if the atomic_inc_return
returned the proper value.

Comment 17 Masami Hiramatsu 2009-03-03 15:34:50 UTC

(In reply to comment #14)
> If we can't find any measurement that show a meaningful improvement
> due to this patch, I'd rather not take the risk that we're forgetting
> some barrier or other, and leave the code as is.

Sure, I agreed. I hope someone who has bigger SMP machine tests this patch and find 
meaningful results.

BTW, I think we can check busy flags safely with smp_call_function.

Comment 18 Josh Stone 2009-03-03 20:02:29 UTC

My apologies if I'm over-thinking this, but I really do think we have a subtle
bug here...

(In reply to comment #16)
> For the "busy" flag, I don't think ordering issues arise if we continue to
> use the atomic.h API.  Concurrent reads/writes are SMP-synchronized - that's 
> the whole point.

Yes, but only regarding that particular variable.  It says nothing about the
ordering wrt neighboring reads/writes.

> Note that we only write c->fields if the atomic_inc_return
> returned the proper value.

Ah, ok, atomic_ops.txt does document that atomic operations which return a value
will act as a full memory barrier both before and after.  The atomic_dec()s may
still be an issue though, which we could fix either by making them
atomic_dec_return(), or inserting an smp_mb__before_atomic_dec().

(In reply to comment #17)
> Sure, I agreed. I hope someone who has bigger SMP machine tests this patch 
> and find meaningful results.

Unfortunately, I don't have any such machines.  But anyway, you could also
consider using local_t, which keeps the same barrier semantics as atomic_t
without the cross-CPU bus locks.

> BTW, I think we can check busy flags safely with smp_call_function.

It may be safe, but why would you want to do this?  I see no advantage over the
simple loop that we have now...

Comment 19 Josh Stone 2009-03-03 20:13:30 UTC

(In reply to comment #18)
> The atomic_dec()s may still be an issue though, which we could fix either by 
> making them atomic_dec_return(), or inserting an smp_mb__before_atomic_dec().

Hmm, actually I guess there aren't any context writes in the probe exit-path
that would be hurt by the decrements being reordered and an interrupt jumping
in.  So maybe this is ok as-is.

Comment 20 Josh Stone 2009-03-07 23:24:07 UTC

Another possibility is to choose something like c->probe_point to protect
against reentrancy instead, and update that field with compare-exchange.

Comment 21 Frank Ch. Eigler 2009-03-07 23:59:58 UTC

(In reply to comment #20)
> Another possibility is to choose something like c->probe_point to protect
> against reentrancy instead, and update that field with compare-exchange.

Only if no probe cannot possibly reenter itself.

Comment 22 Josh Stone 2009-03-08 00:32:52 UTC

(In reply to comment #21)
> (In reply to comment #20)
> > Another possibility is to choose something like c->probe_point to protect
> > against reentrancy instead, and update that field with compare-exchange.
> 
> Only if no probe cannot possibly reenter itself.

Why is that?  The probe would just cmpxchg against 0 -- if it returns anything
non-0, then somebody is already in there.  This works even if that somebody is
the same probe.

Comment 23 Frank Ch. Eigler 2009-03-08 05:52:48 UTC

> Why is that?  The probe would just cmpxchg against 0

Ah yes, you're right.

Comment 24 Frank Ch. Eigler 2011-06-10 19:52:38 UTC

Probably not worth it; can reevaluate if performance differences are observed.