pthread_mutex_unlock takes 15x longer than pthread_mutex_lock

Mon Oct 2 20:28:00 GMT 2017

Thanks for the insights!

I tested with pthread_spin_lock and also with reducing the critical-section
to a single counter-increment but couldn't change the results at all. See
extracted results from a 10 sec VTune snapshot below. I am starting to
think if the difference b/w unlock and lock calls can be explained by the
interaction of the stack and function calls to the dynamic library somehow?

By the way, I am using the default mutex, which is defined
as PTHREAD_MUTEX_NORMAL, and this is what I use to initialize them:

    pthread_spin_init(&m_spinLock, PTHREAD_PROCESS_PRIVATE);
    pthread_mutex_init(&m_mutexLock, NULL);

-------------------
Line     Source       Total time
-------------------
379    pthread_spin_lock(&m_spinLock); 0.052s
380    i = 1000;
381    pthread_spin_unlock(&m_spinLock); 0.592s
382    i = 2000;
383    pthread_mutex_lock(&m_mutexLock); 0.037s
384    i = 3000;
385    pthread_mutex_unlock(&m_mutexLock); 0.615s
386    i = 4000; 0.001s

On Mon, Oct 2, 2017 at 3:07 AM, Ralph Loader <ralph1loader@gmail.com> wrote:

> Hi,
>
> At least when there is no contention on the lock, the important part of
> both the lock and unlock on x86 is an atomic operation to update the lock
> state.
>
> The time that an atomic operation takes on x86 is highly variable. I
> believe that this is because pending data in processor write buffers are
> flushed into the memory hierarchy by the atomic operation.
>
> I am guessing that your application does a memcpy() or similar of the
> packet before the unlock, and flushing that out of the write buffers is
> causing the timing you observe.
>
> (Incidently, I learnt about this stuff because I too observed very similar
> behavior in a packet processing application.)
>
> One possible mitigation is to use a pthread spinlock instead of a pthread
> mutex; the former is released (on x86) by a normal write, not an atomic op.
> This will make the unlock faster (but possibly take longer for the unlock
> to be observed by other threads).
>
> Cheers,
> Ralph.
>
>
> On Mon, Oct 2, 2017 at 9:20 AM, Akash Baid <akashbaid@gmail.com> wrote:
>
>> Hi,
>>
>> I have a packet processing application which takes and releases a
>> pthread_mutex_lock for every packet (around 200k times a second). When
>> profiling the performance through Intel VTune, I noticed that the unlock
>> operation takes about 370ns compared to only 25ns for the lock operation.
>> The numbers are averaged over millions of operations and are independent
>> of
>> the size of the critical section.
>>
>> I suspected lock contention to be playing a role here and thus have
>> ensured
>> that no other thread accesses this lock but the numbers still hover around
>> the same mark. Any ideas as to whether this is expected behavior and if
>> not
>> what might be going on? My understanding was that the unlock operation
>> under no contention should be free of system calls and extremely fast. In
>> fact single stepping through the instructions suggest that there aren't
>> any
>> system calls but maybe the LOCK assembly instruction inside the low-level
>> lock lll_unlock routine is taking long in the unlock case for some reason?
>>
>> The application is using libpthread dynamic shared library, glibc
>> version 2.13 on a x86_64 GNU/Linux VM on an Intel Xeon E5 2.5Ghz processor
>> machine.
>>
>
>