This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 24 Jan 2015 22:01, Maarten Bosmans wrote: > When the current malloc implementation uses mmap to directly fulfil an > allocation request, it returns an address that is always aligned to a > page boundary + 16 bytes. When multiple such arrays are accessed in > the same order, like in the following example code, performance is > suboptimal due to cache conflicts. > > static void mmap_alignment_test(unsigned n_arr, size_t length) { > /* allocate [n_arr] arrays of [length] integers */ > int16_t *arr[n_arr]; > for (unsigned a = 0; a < n_arr; a++) { > arr[a] = malloc(length * sizeof(int16_t)); > } > /* fill the arrays, interleaving writes to each array */ > for (size_t i = 0; i < length; i++) { > for (unsigned a = 0; a < n_arr; a++) { > arr[a][i] = i; > } > } > } > > The performance impact can be seen in this graph[1], where the > results are shown for executing this code with n_arr=1 to 20 and > length=50000. By default glibc satisfies these small (100kB) requests > from its heap, but by setting MALLOC_MMAP_THRESHOLD_ to a suitably > small value, they can be forced to come directly from the mmap system > call. You can see quite clearly that the code is run on a cpu with an > 8-way associative cache, as that is the point where the similarly > aligned mmapped arrays start conflicting. > > My proposal is to use the extra (unused) space that we get from mmap > anyway (because it is page-aligned) to add an offset to the returned > pointer. This would improve the performance of this example test case > when the arrays are large enough to be mmapped directly. > > I would like to get some feedback whether glibc developers think this > is a worthwhile goal to pursue, before I start working on a patch. while i'm not against making programs work faster when possible, i'm not sure your example here is a good one. it seems like you're purposefully writing (imo) bad code that ignores the realities of cpu caches. iow, if your program is at this level of optimization, maybe it'd be better reading: http://www.akkadia.org/drepper/cpumemory.pdf especially when you start talking about creating artificially bad scenarios by turning up the MALLOC_MMAP_THRESHOLD_ knob. forcing lots of allocations to come from direct mmap's will put pressure on the system and can be even worse for performance than cache-hostile code like you've shown here. it might help your case if you had a real world example that didn't specifically do both of those things ... -mike
Attachment:
signature.asc
Description: Digital signature
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |