This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Possible inline malloc alternatives: bitmap


On Fri, Mar 02, 2018 at 12:21:27PM -0800, Paul Eggert wrote:
> On 03/02/2018 08:27 AM, Carlos O'Donell wrote:
> >lifetime information is not at all
> >available in the current API, and we need novel research into ways to mitigate
> >this issue
> 
> We do need research here. And it's not just lifetime info; it's also
> cache pollution. I looked for relevant recent work and found
> NightWatch:
> 
> Liao X, Guo R, Jin H. Enhancing the Malloc System with Pollution
> Awareness for Better Cache Performance. IEEE Trans Parallel
> Distributed Sys. 2017-03-01. 28(3):731-45.
> http://dx.doi.org/10.1109/TPDS.2016.2587644
> 
> NightWatch doesn't do what you want (it focuses on cache-aware
> malloc). Also, it requires Linux kernel mods. Still, it may be
> useful as a source of ideas.

This is paywalled so I didn't read it. Abstract is too vague, it doesn't
say what exactly they do and claimed speedup looks like marketing trick
as it is too big considering that it applies only when memory is managed with lot
of small allocations. 

One problem is that it merges several different issues. For allocator
I know following issues and how I plan to handle them:

1. Cache line sharing. Multiple small chunks could fit into same cache
line, also two longer chunks could share same cache line at their
borders.

First eliminating border sharing is relatively easy, it suffices that
everything in a main heap has alignment and capacity multiple of 
cache line. For small sizes use pool of same capacities, this allow less
of border sharing if its for example of 1 1/2 cache line capacity. 

Sub-cache-line requests leads to two problems:

1a. Bad cache utilization. You want to avoid situation where only small
part of cache line is frequently accessed(hot) while rest is rarely
accessed(cold). You want hot data with hot data and cold with cold.

Here low hanging fruit is that we use terrible data structure that fills
half of cache line with cold metadata about chunk.

Another good distinction here is if data is short-lived or long lived.
With long lived data it requires complex profiling to find out if its
hot or cold. Short lived is hot by definition as it gets alocated,
something written there and freed before time it should be evicted from
cache. It is easy to check if data is short lived by basic profiling
described below.

Bigger advantage of finding what is short lived is in space management
as long-lived small chunks pining down page are one of main problems,
with all chunks short lived it suffices to avoid allocating from 
that page for while and you will have empty page. Longer explanation how
I plan to profile is [1]

1b. Cross-talk which happens when two cores write to different chunks
which happened to be on same cache line. Cores must synchronize state
which leads to worse performance. 

Per-thread cache and returning chunk to original thread on free is best
that we can do without profiling. It looks that nightwatch does that and
when non-allocator thread writes there it gives cache line with only one 
chunk.

2. Cache layout of multicache-line chunks.
As there isn't border sharing only two relevant factors are
a. When will be data in chunks first accessed.
b. Where in cache hierarchy are chunk's cache lines when free is called.

Of those b. is more important, as probably in a) you could assume that 
data will be written in short order.

Obviously you want to keep giving hot chunks before they became cold
where a. is short.

For b) one could easily find that by profiling of when free happens do 
reads of chunk memory and measure how long it took.

3. Spatial locality and prefetching.

For small allocations this is related to 1a as for subsequent
allocations you want to give chunk in same cache line as they would be
likely accessed together.

Another consideration is to try to make subsequent allocations use
continuos memory area to maximize spatial locality. As heuristic these 
could be accessed in same order as allocated and it allow hw prefetcher 
to do its job.


[1]
For that profiling one would reuse inlining code to add information
about each call site, other means of identifying that are significantly
more time intensive. Here impact is minimal, for allocation it is just
calculating tcb+atomic_malloc_site_offset

For first few allocations this would point to data structure set to
cause call of refill bucket where one return chunks with another of
planned features, allocation-specific free callback. 

This works that one could say that for chunks on given page one will
call given callback instead of normal free. This has nearly zero effect
on free performance as it sets up page metadata to fall through to slow
path before one needs to check if there is callback.

That callback will do statistic how many of allocated chunks were freed.

On 32-th allocation (for example) from given site we change offset or
site data to allocate from long-lived heap.

If user would do equivalent of -fprofile-generate, some
profiling(doesn't really matter what as gcc and this just needs real basics) -fprofile-use
then it would be more accurate and recognizing if long-lived is hot/cold
is doable.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]