Need help to interpret pthread_mutex_t structure as printed by gdb

Tue Aug 16 20:44:00 GMT 2016

For thread race conditions checks I usually ran with Thread Sanitizer
(both gcc and clang uses it).

But again, bug reports for runtime issues in default usage modes are
usually debugged with *reproducible* testcase. It could be either a snippet
that stresses the issue or a more complex scenario on how to trigger the
issue.

What it is somewhat frustrating is report with partial debug information
where one must constantly assume or guess scenarios about the origin
environment.

So if you could provide a way to reproduce this issue it would be
*really* helpful to check what it is really happening.

On 16/08/2016 16:53, Olle Blomgren wrote:
> I have now run valgrind tools memcheck, helgrind and drd on both
> client and server. The latter two are intended for MT programs using a
> pthread implementation. When I run the tools on the client side, the
> hang never occurs. When I run the tools on the server the hang occurs
> as usual. Although the hang occurs there is nothing in the logs that
> suggests an error that would cause the hang.
> 
> When I use the helgrind tool it is possible to attach to the valgrind
> GDB server with a GDB client and query all the known mutexes and ask
> to see if any are held. This query reports 0 locked mutexes although
> the client hangs on a mutex it believes is locked, as I pass the query
> (monitor info locks).
> 
> I have also run the Linux GRUB memtest86+ to ensure my RAM is
> physically ok, and it is.
> 
> Do you have any debug tools or any other hints on what I can try to
> crack this on my own?  If not, I would really appreciate sending my
> code to build and run on an x86_64 machine at the hands of someone
> more knowledgeable than myself.
> 
> 
> On Mon, Aug 15, 2016 at 11:45 PM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>> On 15/08/2016 18:14, Olle Blomgren wrote:
>>>> Could you provide the glibc version, a complete backtrace of both processes
>>>> and if possible a testcase?
>>>
>>> I use glibc-2.23.
>>> I have downloaded the source and built it with debug symbols. Full
>>> back traces follows below.
>>> Client back trace + printout of offending mutex. (This is the one that hangs)
>>> -------------------------------------------------------------------------------------------------------------------
>>> Thread 3 (Thread 0x7ff8d1fcc700 (LWP 14005)):
>>> #0  0x00007ff8d231b82d in poll () at ../sysdeps/unix/syscall-template.S:84
>>> #1  0x00007ff8d2b39434 in _daemon_monitor (attr=0x7ff8d2da8300
>>> <_mon_thr_attr>) at mam.c:1346
>>> #2  0x00007ff8d2db03e4 in start_thread (arg=0x7ff8d1fcc700) at
>>> pthread_create.c:333
>>> #3  0x00007ff8d23248ad in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>
>>> Thread 2 (Thread 0x7ff8d15cb700 (LWP 14006)):
>>> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
>>> #1  0x00007ff8d2db2a75 in __GI___pthread_mutex_lock
>>> (mutex=0x7ff8d16ac810) at ../nptl/pthread_mutex_lock.c:80
>>> #2  0x00007ff8d2b1f962 in lch_lock (name=919552, type=LCH_EXCL,
>>> tx=0x7ff8d3446080, ctx=0x7ff8c4001110)
>>>     at lch.c:124
>>> #3  0x00007ff8d2b40c1c in _get_page_get_entry (entry=0x7ff8d15cab80,
>>> nbr_bytes=8, ctx=0x7ff8c4001110)
>>>     at pge.c:1194
>>> #4  0x00007ff8d2b41b36 in pge_alloc_entry (entry=0x7ff8d15cab80,
>>> nbr_bytes=8, page_sz=1 '\001',
>>>     ctx=0x7ff8c4001110) at pge.c:1416
>>> #5  0x00007ff8d2b5d0bf in api_int_setup_entry
>>> (entry_arr=0x7ff8d15cacb8, page_ptr=0x7ff8d15cac88,
>>>     page_ref=0x7ff8d15cac80, lr=0x7ff8d15cad50, meta=0xf68d50,
>>> tx=0x7ff8d3446080, nbr_entries=1, sct=0,
>>>     ctx=0x7ff8c4001110) at api_int.c:47
>>> #6  0x00007ff8d2b5b410 in hope_new (tx=0x7ff8c4000cd0,
>>> elm_type=0x40b1c9 "NoColl", nv_list=0x40adf9 "tjing")
>>>     at hope.c:2495
>>> #7  0x0000000000405d6d in _new1 (attr=0x0) at api_tests.c:1579
>>> #8  0x00007ff8d2db03e4 in start_thread (arg=0x7ff8d15cb700) at
>>> pthread_create.c:333
>>> #9  0x00007ff8d23248ad in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>
>>> Thread 1 (Thread 0x7ff8d34e0700 (LWP 14004)):
>>> #0  0x00007ff8d2db164d in pthread_join (threadid=140706641131264,
>>> thread_return=0x0) at pthread_join.c:90
>>> #1  0x0000000000406252 in _scan_1new40000_1read_1lockmod5remove_3thr
>>> (tc=30, print_desc_only=0)
>>>     at api_tests.c:1682
>>> #2  0x000000000040a1b5 in _all_test_cases (tc=30, nbr_updates=0,
>>> start_value=0, name=0x0,
>>>     ref=18446744073709551615, print_desc_only=0) at api_tests.c:2981
>>> #3  0x000000000040a784 in main (argc=4, argv=0x7fffed178988) at api_tests.c:3084
>>> (gdb) t 2
>>> [Switching to thread 2 (Thread 0x7ff8d15cb700 (LWP 14006))]
>>> #1  0x00007ff8d2db2a75 in __GI___pthread_mutex_lock
>>> (mutex=0x7ff8d16ac810) at ../nptl/pthread_mutex_lock.c:80
>>> 80          LLL_MUTEX_LOCK (mutex);
>>> (gdb) f 1
>>> #1  0x00007ff8d2db2a75 in __GI___pthread_mutex_lock
>>> (mutex=0x7ff8d16ac810) at ../nptl/pthread_mutex_lock.c:80
>>> 80          LLL_MUTEX_LOCK (mutex);
>>> (gdb) p *mutex
>>> $2 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
>>> __kind = 0, __spins = 0, __elision = 0,
>>>     __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats
>>> 39 times>, __align = 0}
>>>
>>>
>>> Server back trace (All threads are just waiting around.)
>>> ------------------------------------------------------------------------
>>> Thread 6 (Thread 0x7f0686b45700 (LWP 14094)):
>>> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>> #1  0x00007f0687838d8b in migrator (attr=0x7f0687c4f758) at hsh_int.c:207
>>> #2  0x00007f06875fb3e4 in start_thread (arg=0x7f0686b45700) at
>>> pthread_create.c:333
>>> #3  0x00007f0686e2d8ad in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>
>>> Thread 5 (Thread 0x7f0684b41700 (LWP 14098)):
>>> #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
>>> #1  0x00007f0687896555 in _do_checkpoint (attr=0x0) at ckp.c:1411
>>> #2  0x00007f06875fb3e4 in start_thread (arg=0x7f0684b41700) at
>>> pthread_create.c:333
>>> #3  0x00007f0686e2d8ad in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>
>>> Thread 4 (Thread 0x7f0685342700 (LWP 14097)):
>>> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>> #1  0x00007f0687838d8b in migrator (attr=0x7f0687c51280) at hsh_int.c:207
>>> #2  0x00007f06875fb3e4 in start_thread (arg=0x7f0685342700) at
>>> pthread_create.c:333
>>> #3  0x00007f0686e2d8ad in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>
>>> Thread 3 (Thread 0x7f0685b43700 (LWP 14096)):
>>> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>> #1  0x00007f06878541bc in _rm_log (attr=0x0) at tlg.c:1527
>>> #2  0x00007f06875fb3e4 in start_thread (arg=0x7f0685b43700) at
>>> pthread_create.c:333
>>> #3  0x00007f0686e2d8ad in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>
>>> Thread 2 (Thread 0x7f0686344700 (LWP 14095)):
>>> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>> #1  0x00007f0687853110 in flush_log_buffer (attr=0x0) at tlg.c:1385
>>> #2  0x00007f06875fb3e4 in start_thread (arg=0x7f0686344700) at
>>> pthread_create.c:333
>>> #3  0x00007f0686e2d8ad in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>
>>> Thread 1 (Thread 0x7f0687cef700 (LWP 14093)):
>>> #0  0x00007f0686e264f3 in select () at ../sysdeps/unix/syscall-template.S:84
>>> #1  0x00007f068785dcf0 in _daemon_server_loop (attr=0x7ffff2d255d0) at mam.c:730
>>> #2  0x00007f068785e1bc in mam_daemon_server_run (ctx=0x7ffff2d255d0)
>>> at mam.c:860
>>> #3  0x00007f0687870053 in dae_init (instance=0x212d180
>>> "/tmp/func_api", force=1, ctx=0x7ffff2d255d0)
>>>     at dae.c:297
>>> #4  0x00007f068785bf4a in _do_start_daemon (ctx=0x7ffff2d255d0) at clu.c:320
>>> #5  0x00007f068785c388 in clu_execute (ctx=0x7ffff2d255d0) at clu.c:449
>>> #6  0x0000000000400994 in main (argc=6, argv=0x7ffff2d25b58) at db.c:21
>>>
>>> I can see that it is always server thread #5 that interferes with the
>>> hanging client thread in the following way:
>>> Server thread acquires mutex X.
>>> Client thread acquires mutex X (it sits waiting for the server thread
>>> to release mutex X).
>>> Server thread releases mutex X and continues its work...
>>> Client thread never wakes up. It continues to hang on mutex X.
>>>
>>> I have written a test program to try to provoke the same access
>>> pattern but I cannot make the same thing happen.
>>> I can see that the same access pattern:
>>> 1) server acquires mutex X
>>> 2) client acquires mutex X
>>> 3) server releases mutex X
>>> 4) client acquires mutex X
>>> completes successfully numerous times before the client hangs as shown above.
>>>
>>> Can you give me hints on how to continue further or do you want me to
>>> send my code to you?
>>> As seen above the hardware is x86_64.
>>>
>>> Brs
>>> /Olle
>>
>> I think either a testcase or a way to reproduce this issue would be the
>> way to proceed since current log is still inconclusive.  I tried to check
>> on sourceware bug database to check some fixed issue since 2.19, but you
>> seemed to already checked with a more recent one.
>>
>> I strongly recommend you to try first a memory checker (valgrind, asan)
>> and/or a thread checker (tsan) to see if you are doing something wrong.