This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
> If we allocate a whole L1 cache line for each single-step scratch area, > as you suggest below, is this still a performance concern? I expect that would address any SMP issue, and is certainly an obviously right thing to do. That is not what Will and I are really concerned about. I'm just talking about the hit to icache or whatever other internal processor hooey from rewriting the same spot, and executing a spot that was only just written (and so by definition never partially decoded into some part of the CPU), that sort of thing. The x86 doesn't require explicit icache flushes with big red warning labels on them every time you poke a location and then want to execute it, like most other processors do--but that doesn't mean it mightn't be costly to do so. I seem to have lost the message I'm sure there was in the thread where you (I think it was you) asked about testing scenarios to compare the performance. That is what we should get right to, instead of just us pontificating about how it might be (I sure don't actually know anything about the chips' performance issues at this level). There are some obvious torture tests that seem to me like they would demonstrate a bottleneck on executing just-modified code if there is one. For example, write a tight loop that you run for a whole lot of iterations so as to usefully time it. Insert several probes at instructions inside the loop, doing all the insertions just once at the beginning. The probes needn't do anything but return, just be there to cause the kprobes single-step machinery to work (and multiple probes to demonstrate the constant reuse of the copy slot). Then run the loop a lot, sampling the cycle counter before and after. Do this with the current code and with the new one that uses a single buffer (repeat each run a lot and average, etc). You might or might not want to correct for other differences like cache-alignment of the instruction copies (in your new plan, the one spot will be aligned, whereas in the current code most of the slots will wind up misaligned). If overwriting a single copy location performs better, then great. If it performs just as well, then it's still preferable for its smaller kernel memory footprint. But if it turns out to perform less well, I think we should stick with the current scheme. (I honestly don't see any real problems in having allocation take place at probe insertion/removal time.) Thanks, Roland
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |