As noted in bug #10839 comment #3, using kmalloc for each individual kretprobe_instance wastes a lot of kernel memory. In my measurement, each 40 byte kmalloc request actually reserves 136 bytes. Having a large number of kretprobes with large maxactive values makes this waste add up quickly.
Some historical perspective: We originally allocated all the instances for a particular kretprobe in a block. kmallocing the instances individually solved a problem we encountered when the kretprobe was unregistered while some instances were still active. We have to turn loose of the kretprobe object when the user unregisters it. One alternative is to treat the instances as a pool separate from their kretprobe -- in which case each instance would probably need a pointer to its pool as well as its rp pointer. (The rp pointer is promised by the API.) It might also be good to explore the (probably heretical) idea of allocating the instances (GFP_ATOMIC, since we can't block) only as needed. I also worked on a prototype where a group of kretprobes shared a common pool -- the idea being that of all the functions you're probing, only a limited set would be active at any one time.
Created attachment 4344 [details] Be smarter in allocating kretprobe instances All along, kretprobes would allocate NR_CPU kretprobe_instance structures. Use a more logical num_possible_cpus() instead. If this patch is acceptable, I'll post it upstream. Jim, Masami?
This patch should work; a simple test on various machines shows: power5 with 8 processors: NR_CPUS = 128, num_online_cpus = 8, num_possible_cpus = 14 x86_64 with 8 processors: NR_CPUS = 255, num_online_cpus = 8, num_possible_cpus = 8 F12 KVM instance: NR_CPUS = 32, num_online_cpus = 2, num_possible_cpus = 16
Looks good; num_online_cpus is also plausible as we don't handle hotplugged cpus particularly well.
Well, I thought of num_online_cpus(), but since the per_cpu infrastructure prefers num_possible_cpus(), I used it.
(In reply to comment #2) > Created an attachment (id=4344) > Be smarter in allocating kretprobe instances > > All along, kretprobes would allocate NR_CPU kretprobe_instance structures. Use > a more logical num_possible_cpus() instead. > > If this patch is acceptable, I'll post it upstream. > > Jim, Masami? It looks good for me, thanks!
Subject: Re: kretprobes waste a lot of memory on kretprobe_instances Quoting ananth at in dot ibm dot com <sourceware-bugzilla@sourceware.org>: > > ------- Additional Comments From ananth at in dot ibm dot com > 2009-10-30 07:29 ------- > Created an attachment (id=4344) > --> (http://sourceware.org/bugzilla/attachment.cgi?id=4344&action=view) > Be smarter in allocating kretprobe instances > > All along, kretprobes would allocate NR_CPU kretprobe_instance > structures. Use > a more logical num_possible_cpus() instead. > > If this patch is acceptable, I'll post it upstream. > > Jim, Masami? > > -- > > > http://sourceware.org/bugzilla/show_bug.cgi?id=10869 > > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > Looks good to me. While you're in that code... the default maxactive for non-preemptable, non-SMP systems is (still) 1. That's always felt pretty uncomfortable. But that's a different discussion, and presumably a different patch.
http://lkml.org/lkml/2009/10/30/76
Patch now in -mm
Ananth, there were actually two aspects of this that I hoped to see fixed. One is the sheer number of kretprobe_instances, which you have addressed. The other is the waste due to the allocation method, where each instance is taking >3x the amount of memory it needs. Even with a more sane maxactive, this is still painful when you have thousands of kretprobes.
Josh, I am not sure I understand the 3x increase. If you don't allocate char data[0], I don't see why the allocation needs to be 3x the size. However, if the point is that one may not need to preallocate that much memory, that's an intelligence we've to build in and its not a straight-forward thing -- esp since there is no panacea given there are very hot paths and ones that may never be hit. Did you have any ideas on that front?
(In reply to comment #11) > Josh, I am not sure I understand the 3x increase. If you don't allocate char > data[0], I don't see why the allocation needs to be 3x the size. It's because kmalloc() pads the request to the allocation pool unit size. > Did you have any ideas on that front? One possibility may be to allocate all <maxactive> instances in one kmalloc, subdividing the returned space by hand.
Re kmalloc unit size, its a kernel slab allocator 'feature'; re bulk allocation, see Jim's comment above as to why that was done away with. For now, I think the best solution is to be prudent with MAXACTIVE. In most 'Enterprise' cases, we don't have CONFIG_PREEMPT and hence the 'wastage' isn't significant, given the general shift to 64bit platforms.