10869 – kretprobes waste a lot of memory on kretprobe_instances

Bug 10869 - kretprobes waste a lot of memory on kretprobe_instances

Summary: kretprobes waste a lot of memory on kretprobe_instances

Status:	RESOLVED FIXED

Alias:	None

Product:	systemtap
Classification:	Unclassified
Component:	kprobes (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ananth Mavinakayanahalli

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-10-29 16:57 UTC by Josh Stone
Modified:	2009-11-03 13:36 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Be smarter in allocating kretprobe instances (329 bytes, patch) 2009-10-30 07:29 UTC, Ananth Mavinakayanahalli	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Josh Stone 2009-10-29 16:57:49 UTC

As noted in bug #10839 comment #3, using kmalloc for each individual
kretprobe_instance wastes a lot of kernel memory.  In my measurement, each 40
byte kmalloc request actually reserves 136 bytes.  Having a large number of
kretprobes with large maxactive values makes this waste add up quickly.

Comment 1 Jim Keniston 2009-10-29 17:27:32 UTC

Some historical perspective: We originally allocated all the instances for a
particular kretprobe in a block. kmallocing the instances individually solved a
problem we encountered when the kretprobe was unregistered while some instances
were still active.  We have to turn loose of the kretprobe object when the user
unregisters it.

One alternative is to treat the instances as a pool separate from their
kretprobe -- in which case each instance would probably need a pointer to its
pool as well as its rp pointer.  (The rp pointer is promised by the API.)

It might also be good to explore the (probably heretical) idea of allocating the
instances (GFP_ATOMIC, since we can't block) only as needed.

I also worked on a prototype where a group of kretprobes shared a common pool --
the idea being that of all the functions you're probing, only a limited set
would be active at any one time.

Comment 2 Ananth Mavinakayanahalli 2009-10-30 07:29:22 UTC

Created attachment 4344 [details]
Be smarter in allocating kretprobe instances

All along, kretprobes would allocate NR_CPU kretprobe_instance structures. Use
a more logical num_possible_cpus() instead.

If this patch is acceptable, I'll post it upstream.

Jim, Masami?

Comment 3 Ananth Mavinakayanahalli 2009-10-30 07:32:19 UTC

This patch should work; a simple test on various machines shows:

power5 with 8 processors:
NR_CPUS = 128, num_online_cpus = 8, num_possible_cpus = 14

x86_64 with 8 processors:
NR_CPUS = 255, num_online_cpus = 8, num_possible_cpus = 8

F12 KVM instance:
NR_CPUS = 32, num_online_cpus = 2, num_possible_cpus = 16

Comment 4 Frank Ch. Eigler 2009-10-30 11:28:27 UTC

Looks good; num_online_cpus is also plausible as we don't handle 
hotplugged cpus particularly well.

Comment 5 Ananth Mavinakayanahalli 2009-10-30 12:11:09 UTC

Well, I thought of num_online_cpus(), but since the per_cpu infrastructure
prefers num_possible_cpus(), I used it.

Comment 6 Masami Hiramatsu 2009-10-30 12:15:29 UTC

(In reply to comment #2)
> Created an attachment (id=4344)
> Be smarter in allocating kretprobe instances
> 
> All along, kretprobes would allocate NR_CPU kretprobe_instance structures. Use
> a more logical num_possible_cpus() instead.
> 
> If this patch is acceptable, I'll post it upstream.
> 
> Jim, Masami?

It looks good for me, thanks!

Comment 7 Jim Keniston 2009-10-30 12:43:50 UTC

Subject: Re:  kretprobes waste a lot of memory on
	kretprobe_instances

Quoting ananth at in dot ibm dot com <sourceware-bugzilla@sourceware.org>:

>
> ------- Additional Comments From ananth at in dot ibm dot com    
> 2009-10-30 07:29 -------
> Created an attachment (id=4344)
>  --> (http://sourceware.org/bugzilla/attachment.cgi?id=4344&action=view)
> Be smarter in allocating kretprobe instances
>
> All along, kretprobes would allocate NR_CPU kretprobe_instance   
> structures. Use
> a more logical num_possible_cpus() instead.
>
> If this patch is acceptable, I'll post it upstream.
>
> Jim, Masami?
>
> --
>
>
> http://sourceware.org/bugzilla/show_bug.cgi?id=10869
>
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
>

Looks good to me.

While you're in that code... the default maxactive for  
non-preemptable, non-SMP systems is (still) 1.  That's always felt  
pretty uncomfortable.  But that's a different discussion, and  
presumably a different patch.

Comment 8 Ananth Mavinakayanahalli 2009-10-30 13:54:09 UTC

http://lkml.org/lkml/2009/10/30/76

Comment 9 Ananth Mavinakayanahalli 2009-11-02 05:15:28 UTC

Patch now in -mm

Comment 10 Josh Stone 2009-11-02 19:33:00 UTC

Ananth, there were actually two aspects of this that I hoped to see fixed.  One
is the sheer number of kretprobe_instances, which you have addressed.  The other
is the waste due to the allocation method, where each instance is taking >3x the
amount of memory it needs.  Even with a more sane maxactive, this is still
painful when you have thousands of kretprobes.

Comment 11 Ananth Mavinakayanahalli 2009-11-03 04:55:10 UTC

Josh, I am not sure I understand the 3x increase. If you don't allocate char
data[0], I don't see why the allocation needs to be 3x the size.

However, if the point is that one may not need to preallocate that much memory,
that's an intelligence we've to build in and its not a straight-forward thing --
esp since there is no panacea given there are very hot paths and ones that may
never be hit.

Did you have any ideas on that front?

Comment 12 Frank Ch. Eigler 2009-11-03 10:54:11 UTC

(In reply to comment #11)
> Josh, I am not sure I understand the 3x increase. If you don't allocate char
> data[0], I don't see why the allocation needs to be 3x the size.

It's because kmalloc() pads the request to the allocation pool unit size.

> Did you have any ideas on that front?

One possibility may be to allocate all <maxactive> instances in one kmalloc,
subdividing the returned space by hand.

Comment 13 Ananth Mavinakayanahalli 2009-11-03 13:36:31 UTC

Re kmalloc unit size, its a kernel slab allocator 'feature'; re bulk allocation,
see Jim's comment above as to why that was done away with. For now, I think the
best solution is to be prudent with MAXACTIVE. In most 'Enterprise' cases, we
don't have CONFIG_PREEMPT and hence the 'wastage' isn't significant, given the
general shift to 64bit platforms.