This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Remove atomic operations from malloc.c


On 11-02-2015 09:01, Torvald Riegel wrote:
> On Wed, 2015-02-11 at 09:30 +0100, Leonhard Holz wrote:
>> The fastbin path of malloc/free features atomic operations, which were probably an 
>> attempt for lock-free code. But there is an ABA problem in malloc's
>>
>> while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
>>          != victim);
>>
>> because even if *fd == victim, victim->fd might have changed in the meantime. 
>> There is no easy way to fix this, an old comment mentions the need for a CAS2 
>> operation with tagged pointers. Consequently malloc is right now wrapped by a lock 
>> to avoid problems.
> I'll have a look at the correctness side of this soon.
>
>> Anyhow, while the atomic ops avoid a lock in free they are costly too. Especially 
>> at high contention the compare_and_exchange may fail multiple times (see 
>> http://blog.memsql.com/common-pitfalls-in-writing-lock-free-algorithms/ for a 
>> discussion). So I measured how removing atomics and installing a lock around free 
>> affects performance and it turns out (at least in my case with 2 cores), that it 
>> has no effect:
>>
>> Current implementation
>> threads time per iteration
>> 1       116.709
>> 8       705.080
>> 16      1394.74
>> 32      2923.03
>>
>> Without atomics
>> threads time per iteration
>> 1       112.541
>> 8       715.897
>> 16      1403.67
>> 32      2881.30
> If your machine has just two cores, then at the very least you should
> measure for just two threads too; a bigger number of threads is not
> putting more contention on any of the synchronization bits, there's just
> some more likelihood to having to wait for a thread that isn't running.
>
> Also, to really assess performance, this has to be benchmarked on a
> machine with more cores.  Additionally, you could argue why it should
> not make a difference, and if that's a compelling argument, we could
> follow it instead of the benchmark (which, as Will mentions, is hard to
> make representative of real-world workloads).
>
I did get into the changes itself, but at least for powerpc (POWER8/16c/128T)
I am not seeing improvements with the patch.  In fact it seems to increase
contention:

           time per iteration
nths       master     patch
1    	   51.422    75.046
8          53.077    78.507
16         57.430    89.385
32         71.206   108.359
64        114.370   172.115
128       251.731   330.924


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]