Summary: | c->busy can be non-atomic. | ||
---|---|---|---|
Product: | systemtap | Reporter: | Masami Hiramatsu <mhiramat> |
Component: | translator | Assignee: | Unassigned <systemtap> |
Status: | RESOLVED WONTFIX | ||
Severity: | normal | CC: | fche |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: |
[PATCH] make busy nonatomic
[PATCH] make busy nonatomic with mb/wmb |
Description
Masami Hiramatsu
2008-10-01 20:38:00 UTC
We can put in a synchronize_sched() before or after or even into the busy-testing loop. I see no reason to *remove* the busy-testing loop though. Hmm, right. I just think we can remove atomic_inc/dec(c->busy) from kprobe-based handlers (and other handler which don't cause context switching) I would prefer to keep c->busy, since it also functions as a reentrancy-prevention mechanism, not just a shutdown-synchronization one. (In reply to comment #3) > I would prefer to keep c->busy, since it also functions as a > reentrancy-prevention mechanism, not just a shutdown-synchronization > one. Oops, Indeed. BTW, kprobes itself has reentrancy checking routine, however other probes(markers/timer/etc.) don't have that(and also don't exclude each other). So it should be suspended until solving reentrant-probing support... (In reply to comment #3) > I would prefer to keep c->busy, since it also functions as a > reentrancy-prevention mechanism, not just a shutdown-synchronization > one. By the way, for reentrancy-prevention, we don't need an atomic-operation, because the reentrance always occurs on the same processor... > By the way, for reentrancy-prevention, we don't need an atomic-operation,
> because the reentrance always occurs on the same processor...
Yes, but a nonatomic increment would not be protected against
interrupts/preemption.
hmm, if we use per-cpu variable for nesting flag, I think interrupts don't need to check it atomically.(as far as I know, the preemption is disabled there...) The operation can be separated into 4 phases --- 1. read flag 2. write flag=1 if (flag == 0) 3. doing some 4. write flag=0 --- If an interrupt occurs between 2 and 4, --- 1. read flag 2. write flag=1 if (flag == 0) >> 1. read flag 2. return due to flag == 1 << 3. doing some >> 1. read flag 2. return due to flag == 1 << 4. write flag=0 --- This case, the flag is operated correctly. If an interrupt occurs between 1 and 2, --- 1. read flag >> 1. read flag 2. write flag=1 if (flag == 0) 3. doing some 4. write flag=0 << 2. write flag=1 if (flag == 0) 3. doing some 4. write flag=0 --- Even this case, the flag is correctly operated, because this case is same as an interrupt happens right before 1. Hmm, however, the oprofile showed this did not so much improve performance on ia64... 0.7.2 without this change: --- 000000000000cc20 9672 1.0968 stap_780d862f8ed31f54314ad730113ad90d_401.ko stap_780d862f8ed31f54314ad730113ad90d_401 enter_kprobe_probe 000000000000cc20 589 6.0897 000000000000cc40 48 0.4963 000000000000cc58 36 0.3722 000000000000cc60 4 0.0414 000000000000cc64 38 0.3929 000000000000cc70 272 2.8122 000000000000cc80 176 1.8197 000000000000cc84 217 2.2436 000000000000cc90 39 0.4032 000000000000cc98 47 0.4859 000000000000cca4 40 0.4136 000000000000ccb0 39 0.4032 000000000000ccb8 60 0.6203 000000000000ccc4 35 0.3619 000000000000ccd0 38 0.3929 000000000000d174 7869 81.3586 000000000000d178 49 0.5066 000000000000d180 43 0.4446 000000000000d188 33 0.3412 git-with this change: --- 000000000000ce80 8448 0.9590 stap_9654f52ad4d73402acf5522fa20ffae3_401.ko stap_9654f52ad4d73402acf5522fa20ffae3_401 enter_kprobe_probe 000000000000ce80 540 6.3920 000000000000cea0 39 0.4616 000000000000ceb8 40 0.4735 000000000000cec4 48 0.5682 000000000000ced0 230 2.7225 000000000000cee0 161 1.9058 000000000000cee4 200 2.3674 000000000000cef0 37 0.4380 000000000000cef8 44 0.5208 000000000000cf04 34 0.4025 000000000000cf10 31 0.3670 000000000000cf18 42 0.4972 000000000000cf24 36 0.4261 000000000000cf30 43 0.5090 000000000000d3d4 6801 80.5043 000000000000d3d8 53 0.6274 000000000000d3e0 33 0.3906 000000000000d3e8 35 0.4143 000000000000db20 1 0.0118 So, this issue is not so important... Created attachment 3782 [details]
[PATCH] make busy nonatomic
This patch changes c->busy to nonatomic, according to above logic.
Previously, I could not find effective difference because the machine I used
was not enough large. maybe, if we test it on enough large SMP machine, it
could show difference.
Please test it.
Thanks,
(In reply to comment #9) > Created an attachment (id=3782) > [PATCH] make busy nonatomic It should be fine to make this non-atomic, but I think it needs barriers to preserve memory order. Otherwise the compiler/architecture could reorder the writes into the context pointer. It might turn into: 1. write some c->fields >>> interrupt with a probe -> step #1 is overwritten 2. write c->busy = 1 3. write more c->fields, run the probe, etc. (In reply to comment #10) > (In reply to comment #9) > > Created an attachment (id=3782) > > [PATCH] make busy nonatomic > > It should be fine to make this non-atomic, but I think it needs barriers to > preserve memory order. Otherwise the compiler/architecture could reorder the > writes into the context pointer. It might turn into: > > 1. write some c->fields > >>> interrupt with a probe -> step #1 is overwritten > 2. write c->busy = 1 > 3. write more c->fields, run the probe, etc. Oops, right. Thank you for pointing it out! I'll update it. Created attachment 3784 [details]
[PATCH] make busy nonatomic with mb/wmb
Here, I added wmb() and mb() for protecting context members updating.
Thank you,
(In reply to comment #12) > Created an attachment (id=3784) > [PATCH] make busy nonatomic with mb/wmb > > Here, I added wmb() and mb() for protecting context members updating. Thanks, looks good to me. If we can't find any measurement that show a meaningful improvement due to this patch, I'd rather not take the risk that we're forgetting some barrier or other, and leave the code as is. (In reply to comment #14) > If we can't find any measurement that show a meaningful improvement > due to this patch, I'd rather not take the risk that we're forgetting > some barrier or other, and leave the code as is. I don't think that atomic operations will guarantee memory order either, so we're probably already missing barriers... > I don't think that atomic operations will guarantee memory order either, so
> we're probably already missing barriers...
For the "busy" flag, I don't think ordering issues arise if we continue to
use the atomic.h API. Concurrent reads/writes are SMP-synchronized - that's
the whole point. Note that we only write c->fields if the atomic_inc_return
returned the proper value.
(In reply to comment #14) > If we can't find any measurement that show a meaningful improvement > due to this patch, I'd rather not take the risk that we're forgetting > some barrier or other, and leave the code as is. Sure, I agreed. I hope someone who has bigger SMP machine tests this patch and find meaningful results. BTW, I think we can check busy flags safely with smp_call_function. My apologies if I'm over-thinking this, but I really do think we have a subtle bug here... (In reply to comment #16) > For the "busy" flag, I don't think ordering issues arise if we continue to > use the atomic.h API. Concurrent reads/writes are SMP-synchronized - that's > the whole point. Yes, but only regarding that particular variable. It says nothing about the ordering wrt neighboring reads/writes. > Note that we only write c->fields if the atomic_inc_return > returned the proper value. Ah, ok, atomic_ops.txt does document that atomic operations which return a value will act as a full memory barrier both before and after. The atomic_dec()s may still be an issue though, which we could fix either by making them atomic_dec_return(), or inserting an smp_mb__before_atomic_dec(). (In reply to comment #17) > Sure, I agreed. I hope someone who has bigger SMP machine tests this patch > and find meaningful results. Unfortunately, I don't have any such machines. But anyway, you could also consider using local_t, which keeps the same barrier semantics as atomic_t without the cross-CPU bus locks. > BTW, I think we can check busy flags safely with smp_call_function. It may be safe, but why would you want to do this? I see no advantage over the simple loop that we have now... (In reply to comment #18) > The atomic_dec()s may still be an issue though, which we could fix either by > making them atomic_dec_return(), or inserting an smp_mb__before_atomic_dec(). Hmm, actually I guess there aren't any context writes in the probe exit-path that would be hurt by the decrements being reordered and an interrupt jumping in. So maybe this is ok as-is. Another possibility is to choose something like c->probe_point to protect against reentrancy instead, and update that field with compare-exchange. (In reply to comment #20) > Another possibility is to choose something like c->probe_point to protect > against reentrancy instead, and update that field with compare-exchange. Only if no probe cannot possibly reenter itself. (In reply to comment #21) > (In reply to comment #20) > > Another possibility is to choose something like c->probe_point to protect > > against reentrancy instead, and update that field with compare-exchange. > > Only if no probe cannot possibly reenter itself. Why is that? The probe would just cmpxchg against 0 -- if it returns anything non-0, then somebody is already in there. This works even if that somebody is the same probe. > Why is that? The probe would just cmpxchg against 0
Ah yes, you're right.
Probably not worth it; can reevaluate if performance differences are observed. |