This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

malloc - cache locality

On Tue, Dec 10, 2013 at 04:16:59PM +0100, Torvald Riegel wrote:
> On Tue, 2013-12-10 at 00:04 -0500, Carlos O'Donell wrote:
> > * Add new and required features.
> >   - Add support for "have but don't need" kernel pages via vrange.
> I think it would be useful if that would be driven or at least
> accompanied by a design discussion and documentation.  IOW, a wiki page
> as frequently suggested by Carlos :)
> Patches are fine and eventually what really matters, but there is a vast
> body of literature out there for example, and I would be surprised if
> many of us would really have a exhaustive overview of it.  Thus, an
> overview or collection of relevant existing work could be helpful
> content in this wiki page, even if just to have a central place where we
> can combine our knowledge about the state of the art.  For example, I
> know that the ISMM conference has had lots of allocator stuff over the
> years, but I wouldn't even know which of that is relevant for glibc's
> allocator (IOW, on the short list).
> Furthermore, the code we end up with in glibc is just one option we
> picked.  If well-documented, it will say what it does and why, but it
> won't document all the negative results that lead to us adopting this
> solution eventually.  But we need to document the negative results too,
> or we're prone to forgetting about them, and then we can't reconstruct
> our decision process anymore.  This is especially true given that we'll
> be dealing with a lot of performance-dependent trade-offs here, and
> those will change over time (eg, when hardware changes).
> To illustrate that, let me give a quick attempt at a table of contents:
> * Literature overview (eg, the short list). Links to relevant libc-alpha
> discussions.
> * The workload model that we use
>   * Allocation patterns (e.g., frequent allocation of small objects,
>     long-lived vs. short-lived objects, ...)
>   * Number of threads, allocation where in the threads, etc.
>   * Access patterns of the program to the memory (eg, lots locality in
>     when objects are allocated compared to when they are used, etc.).
>     IOW, how access patterns relate to the allocation patterns.

This is good list, to stay on topic rather than part of one
hundred-message tread we should try to stay on topic.

We should start with most important aspects first, which is a effect of
cache locality which probably impacts performance more than other
effects combined.

A quick summary is that if we want to do something about it we need to
change memory layout. This has a prerequisite of handling
malloc_set_state issue.

As starting point look at tcmalloc and jemalloc design documentation.

I did a search for 'allocator cache effects'. This found following three
articles from around 1990. These try to measure cache effect and compare
with DLmalloc. These are still relevant as dlmalloc algorithm did not
change much.

These have in common that they archieved speedup by getting basics rigth:

- Do not waste cache space of small objects with mostly useless malloc headers.

- The FIFO is worst order from caching perspective. If you have n chunks
but only n-1 pages then you will always miss. In any other order you
have at least one hit.

- Separating areas with small and large allocations. If you allocate
small object between two large ones it will likely cause fragmentation.
Converse is unlikely (gaps tend not be large enough) but if this happens
it could occupy cache that small requests may use better.
Also when I last checked oprofile consolidation took around 8% of time spend in malloc+free.

Now more questionable ones:

- Allocation should not span more cache lines than necessary. These may
increase fragmentation in some cases. Exact tradeoff depends
distribution how many objects are likely referenced. If there are few
often allocated objects then alignment is benefical. When all objects
have similar probability that they are referenced a packed
representation is better.

A tcmalloc somewhat solves fragmentation by having blocks where only allowed sizes are t
and 2^k - t. A t-allocations there are 2^k aligned and 2^k - t ones are shifted by t.

- Prefetching and invalidation. Here we need to make measurements how likely allocated chunks
still reside in L1 cache. For malloc it makes sense to prefetch start of
chunk. This should be done for chunk that is next to line, especially
when we save information there. For small chunks prefetch will likely
finish by next malloc unless chunk is in main memory.

We could use same logic in opposite direction, as with prefetching there
is little difference if small freed chunks are in L1 or L3 cache we
could use instruction to move cache lines that were entirely freed to L3
cache if there was such instruction.

Then there is question when cache invalidation pays off. Freed memory
needs to be in cache only when there is a reasonable chance that it will
be reused soon and eviction can cause freed cache line to unnecessarily
write its context to main memory. Do you have study on this?

These were uncontroversial points, now we need to split sizes to several classes.

a) Requests upto half of cache line. Here we want to maximize a number of chunks that are 
in same cache line as previous chunk. Using address-order implicitly archieves this goal.
Other possibility is divide these into say groups of 32, have bitmap for each group
and when allocating use a group until there is nothing left, then switch to another one.

b) Small requests - these have high turnaround so we need to handle these quickly.
There is also problem that its easy to measure and when this is too slow users start
wrapping these in memory pools.

c) medium requests - for these fragmentation starts to become problem, as I wrote earlier a 
exact best-fit could be found very quickly with appropriate data
structure, question is if it is indeed best as there could be a 1% larger one which is entirely
in cache while best is in main memory. So far a best course looks to
split to chunks to generations, which also could be criterium for cache invalidation

d) large requests - After certain point its unlikely that a memory will
fit into core cache so trying to be thread-local does not make much
sense. It depends if we want to use a shared area for requests of such
sizes or not.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]