On Thu, Mar 26, 2015 at 02:42:45PM +0100, Leonhard Holz wrote:
Hmm.. the malloc/free code has different pathes depending on requested block
size including different locking procedures, so I would like to keep some
kind of size variance.
Yes, the primary different paths are:
1. fastbin allocation
2. regular allocation on the heap
We're only interested in the first two since mmap doesn't need any
locks. In fact, the allocation sizes chosen will never even result in
an mmap, so we don't even have to think about it.
Also the current implementation is not that different
to your proposed schema, actually all threads act as producers and consumers
that are coupled via the round-robin block array which acts as some sort of
queue. And a queue to connect the threads is needed anyway (?), either with
extra data structures (which implies malloc which implies locking) or
explicit locking as implemented.
It is similar, but the crucial difference is that you don't know for
sure how much the threads contend for arenas. You partition the
working set to give each thread a separate set of indices and sizes.
The indices may overlap (and hence cause the contention that we want
to see) but how much they overlap is not clear from just reading the test.
So I would'nt say that the proposed version is not improvable but please be
a bit more specific about what should be achieved and how the proposed way
My suggestion was to make threads either producers or consumers, not
both. It was important for the earlier benchmark to mix up allocation
sizes as well as locations because it tested performance of the
algorithm in general and not lock contention. In this case, your
primary goal would be to measure the performance impact of contention
for the arena lock, so mixing up allocations and sizes don't add a lot
value, it only complicates the test.
An easier way would be to spawn a producer consumer pair where one
allocates and the other frees. The next case would be one producer
and multiple consumers and the final case would be multiple producers
and one consumer. Repeat for larger sets of such threads, i.e. 4
producers and consumers, 2 sets of 1 producer and 4 consumers, and so
on. In all cases, take timing values only around a single malloc or
free and nothing else.