This is the mail archive of the
mailing list for the glibc project.
Re: The direction of malloc?
- From: Torvald Riegel <triegel at redhat dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
- Date: Wed, 18 Dec 2013 15:35:53 +0100
- Subject: Re: The direction of malloc?
- Authentication-results: sourceware.org; auth=none
- References: <20131210121622 dot GA5416 at domone dot podge> <52A75502 dot 6040500 at linux dot vnet dot ibm dot com> <20131210210541 dot GA19161 at domone dot podge> <1387213140 dot 23049 dot 8010 dot camel at triegel dot csb> <20131216212334 dot GA21284 at domone dot podge> <1387285197 dot 23049 dot 9075 dot camel at triegel dot csb> <20131217190817 dot GA32756 at domone dot podge> <1387324884 dot 23049 dot 10462 dot camel at triegel dot csb> <20131218101110 dot GB5472 at domone dot podge> <1387368668 dot 23049 dot 10661 dot camel at triegel dot csb> <20131218140501 dot GA8292 at domone dot podge>
On Wed, 2013-12-18 at 15:05 +0100, OndÅej BÃlka wrote:
> On Wed, Dec 18, 2013 at 01:11:08PM +0100, Torvald Riegel wrote:
> > > > > Real implementation will be bit faster as dynamic tls slows this down a
> > > > > bit.
> > > > >
> > > > > Memory system effects are not a factor here, as allocation pattern is
> > > > > identical (stack in both cases).
> > > >
> > > > I was referring to memory system effects when the data is actually
> > > > accessed by an application. It does matter on which page allocated data
> > > > ends up (and where on a page relative to other allocations).
> > >
> > > And as it ends at same virtual address as algorithms are identical this
> > > does not matter here.
> > You seemed to say that you want to move from concurrent code with
> > synchronization to nonconcurrent code. As far as I was able to
> > interpret what you wrote, it seemed that you wanted to move from
> > (potentially) concurrent allocation from fastbins(?) to strictly
> > per-thread allocation bins. Unless the former is concurrent and uses
> > synchronization for no reason, it should be possible that you can have
> > situations in which threads allocate from more areas than before.
> Not neccessarily, you could add a check if it belong to thread and use
> standard free if not, I wrote bit more about that here which was mainly
> to get comments.
> > > > The speed
> > > > of allocations is just one thing. For example, just to illustrate, if
> > > > you realloc by copying to a different arena, this can slow down programs
> > > > due to NUMA effects if those programs expect the allocation to likely
> > > > remain on the same page after a realloc; such effects can be much more
> > > > costly than a somewhat slower allocation.
> > >
> > > You cannot optimize code for unlikely case.
> > We do NOT know what the unlikely case is, currently. This is why I
> > suggested to start with analyzing application workloads and access
> > patterns, building a model of it (ie, informal but at a level of detail
> > that is sufficient to actually agree on a clear set of assumptions and
> > not just handwaving), document it, and discuss it with the rest of the
> > community.
> > > When a memory is allocated
> > > in thread A and reallocated in thread B there could be three cases
> > >
> > > 1) Memory is passed to thread B which primarily access it.
> > > 2) Memory is shared between A and B.
> > > 3) Memory is primarily accessed by thread A.
> > >
> > > As effect of cases 1) and 3) is symetrical
> > Yes, both can happen, and there might always be a trade-off, and however
> > you decide, you might decrease performance in some situations.
> > > it suffices to estimate which
> > > one is more likely and case 1 seems a best candidate.
> > We do NOT know that. If you do, please show the evidence.
> > > Realloc definitely does move in most cases as common usage pattern is
> > > doubling size allocated and as we use best fit there is not enough room.
> > How do you know it's really a common usage pattern? And, why should it
> > not just be common but one of the most common usage patterns? What is
> > common? Which applications? And so on...
> It is about only way how avoid quadratic slowdown when repeately reallocating.
You didn't answer my questions. I asked, among others why it should be
the most common usage pattern. Your answer is that *if* programs are
repeatedly allocating, then they likely want to double the allocation
size. So, why are repeated reallocations the most common usage pattern?
And why do we have to optimize for the overhead of the realloc calls
themselves compared to how other applications might be using allocated
memory differently (e.g., compared to NUMA access overheads, or whatever
I appreciate that you have ideas for how to improve the allocator. But
at the very least, those need to be validated *thoroughly*, and that
involves looking at all applications using the allocator, not just
applications where the optimization might help.
I believe that once you do a thorough evaluation, you will get a lot
more ideas of how to perhaps improve things, and probably find ideas
that can have a higher impact on performance. Therefore, I would
suggest to start with the analysis first, rather than starting with a
specific idea and going bottom-up.
> Ideally this should not be neccessary as we preallocate twice than requested
> in realloc but we do not do this yet.
I doubt that this idea is worthwile to pursue, because you're making a
very wide assumption here, namely that all uses of realloc would be part
of frequent reallocation and that the space and address space overheads
> This affects gcc which repeately tries
> extend buffer by 8.
Maybe that's inefficient on GCC's behalf (although I suppose they have
(or had, at some point) a reason to do it this way), but that doesn't
mean we need to optimize towards just that in glibc's general purpose
allocator. Do you know why GCC made this choice?
> To test these use following program.
Why don't you build infrastructure to get similar traces for this and
similar relevant behavior of applications, and ways for us to store and
analyze this data with a high level of automation? That is, collect the
allocation patterns, and build analyses that feed into a model of how
applications use the allocator? Based on that, we can draw conclusions
and evaluate changes to the allocator; and it would also allow us to
test our assumptions in the future.