Bug 20291 - Set scheduler-locking on doesn't work correctly with pthread_create
Summary: Set scheduler-locking on doesn't work correctly with pthread_create
Status: UNCONFIRMED
Alias: None
Product: gdb
Classification: Unclassified
Component: threads (show other bugs)
Version: HEAD
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-22 18:11 UTC by Martin Galvan
Modified: 2018-05-16 18:54 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Example code that triggers the bug (200 bytes, text/x-csrc)
2016-06-22 18:11 UTC, Martin Galvan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Galvan 2016-06-22 18:11:40 UTC
Created attachment 9350 [details]
Example code that triggers the bug

In the following code:

#include <stdio.h>
#include <pthread.h>

void *thread_func(void *unused)
{
    puts("Child thread.");

    return NULL;
}

int main()
{
    pthread_t thread;

    if (pthread_create(&thread, NULL, thread_func, NULL) == 0) /* Stop here */
    {
        puts("Child thread created.");

        if (pthread_join(thread, NULL) == 0)
        {
            puts("Child thread joined.");
        }
    }

    return 0;
}

If I stop at the "Stop here" line (before calling pthread_create), issue a "set scheduler-locking on", and then a "next", the child thread will still run (and finish) before "next" finishes:

Temporary breakpoint 1, main () at example.c:15
15	    if (pthread_create(&thread, NULL, thread_func, NULL) == 0) /* Stop here */
(gdb) set scheduler-locking on
(gdb) next
[New Thread 0x7ffff77f6700 (LWP 13087)]
Child thread.
[Thread 0x7ffff77f6700 (LWP 13087) exited]
17	        puts("Child thread created.");
(gdb)

This doesn't happen with gdb 7.7.1, so I'm guessing some later commit introduced the bug. The schedlock tests run fine for the current git master, so I guess this wasn't covered by the testsuite.
Comment 1 Martin Galvan 2016-06-22 21:09:34 UTC
Using git-bisect I've traced the issue back to commit c1a747c10948e2298083179f4e8aeed8b962e2af:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git&a=commit&h=c1a747c10948e2298083179f4e8aeed8b962e2af

I'm still surprised that it didn't cause any regressions on the schedlock tests, though. I'll take a deeper look at this and report back.
Comment 2 Pedro Alves 2016-06-22 21:20:34 UTC
What I imagine is happening is that:

- Before, whenever a new thread appeared, everything would stop for the thread  event breakpoint, and then the event got reported to infrun.c.  infrun.c would then only continue the thread that was originally resumed.

- Now, new threads don't cause a infrun.c-visible stop.  linux-nat.c:resume_stopped_resumed_lwps resumes the new thread right after we handle the PTRACE_EVENT_CLONE that discovered the new thread.

The problem is that the target backend has no idea whether infrun wants new threads to be resumed or whether it wants them stopped.

I think one solution would for infrun to enable the new support for reporting thread create/exit events (target_thread_events -> TARGET_WAITKIND_THREAD_CREATED) whenever schedlock is in effect.

> I'm still surprised that it didn't cause any regressions on the schedlock 
> tests, though. I'll take a deeper look at this and report back.

I think schedlock.c doesn't spawn new threads while the meat of the test is running.  Instead, it first starts all threads in a preparation phase.
Comment 3 Pedro Alves 2016-06-23 09:11:45 UTC
Note that nothing in gdb ever tried to make sure that "set scheduler-locking on" locks new threads.  Many targets can't even trap when a new thread is created.

Actually, even when debugging against GNU/Linux gdbserver, even before that patch, I believe you'll see the child running free, because gdbserver hasn't been using libthread_db.so thread event breakpoints for a long while.  Even when it did, those thread event stops weren't reported to gdb, so the new threads would still run free.

I think it was mainly working "by chance" before.
Comment 4 Pedro Alves 2016-06-23 09:26:01 UTC
> I think it was mainly working "by chance" before.

And also because the thread event breakpoint is triggered on the _parent_ thread.
If you enable "set debug infrun 1" / "set debug lin-lwp 1" on an older gdb, you should see that the child manages to run a bit before the event breakpoint triggers.  When the breakpoint is hit, then everything is stopped, and then infrun only resumes the parent.
Comment 5 Martin Galvan 2016-06-23 14:06:50 UTC
(In reply to Pedro Alves from comment #3)
> Note that nothing in gdb ever tried to make sure that "set scheduler-locking
> on" locks new threads.  Many targets can't even trap when a new thread is
> created.
> 
> Actually, even when debugging against GNU/Linux gdbserver, even before that
> patch, I believe you'll see the child running free, because gdbserver hasn't
> been using libthread_db.so thread event breakpoints for a long while.  Even
> when it did, those thread event stops weren't reported to gdb, so the new
> threads would still run free.
> 
> I think it was mainly working "by chance" before.

In that case, should this even be considered a bug? Perhaps we could add something to the documentation so that other users won't trip on this again.
Comment 6 Pedro Alves 2016-06-23 14:16:09 UTC
> In that case, should this even be considered a bug? Perhaps we could add 
> something to the documentation so that other users won't trip on this again.

I'm not sure, haven't made up my mind about it myself.

The fact that is only worked by chance doesn't mean that we shouldn't fix it properly.

My knee-jerk reaction was that it's a bug, though I'm slightly mildly concerned
about the potential maybe-surprising hangs over pthread_create and similars.
But maybe not that concerned.  The user did ask for nothing else to be scheduled.
Comment 7 Martin Galvan 2016-06-23 17:53:41 UTC
(In reply to Pedro Alves from comment #6)
> > In that case, should this even be considered a bug? Perhaps we could add 
> > something to the documentation so that other users won't trip on this again.
> 
> I'm not sure, haven't made up my mind about it myself.
> 
> The fact that is only worked by chance doesn't mean that we shouldn't fix it
> properly.
> 
> My knee-jerk reaction was that it's a bug, though I'm slightly mildly
> concerned
> about the potential maybe-surprising hangs over pthread_create and similars.
> But maybe not that concerned.  The user did ask for nothing else to be
> scheduled.

That sounds to me like an implementation detail leaking out of glibc through gdb. If I understood correctly, the parent thread locks a mutex in __pthread_create_2_1 so that the child will block inside start_thread before it can jump to its user-provided start routine. IMHO, as a user I shouldn't have to be aware of such details when using a (seemingly) inocuous command.
Comment 8 Pedro Alves 2016-06-23 17:58:26 UTC
What do you suggest gdb should do?

Baking in awareness of glibc's internals in order to let the child run a bit is not that appealing.

Note this sort of hang can happen when you step over _any_ function that blocks on some synchronization primitive (with schedlock on).
Comment 9 Martin Galvan 2016-06-23 19:39:27 UTC
(In reply to Pedro Alves from comment #8)
> Baking in awareness of glibc's internals in order to let the child run a bit
> is not that appealing.

Agreed. That would be ugly.

> Note this sort of hang can happen when you step over _any_ function that
> blocks on some synchronization primitive (with schedlock on).

I'm going through the glibc code again and I'm kind of lost here. You said the hang would arise because the parent thread would wait forever for the child to release the lock. However, create_thread first locks pd->lock, and then calls clone (through ARCH_CLONE), passing it the glibc-internal start_thread as starting routine. start_thread, in turn, will try to lock that same lock, blocking the child. The parent then has a chance to set some more stuff on the child, and then it unlocks it so the child can go on and call its user-provided routine.

What this means is, the child waits for the parent and not the other way around. If schedlock somehow manages to prevent the child from ever being scheduled, the parent shouldn't mind (unless of course the user later calls pthread_join, but that's their problem :)).

In any case, I think you're right in that the parent might hang elsewhere. My point was that it's a bad thing if it happens in library code, but the user can reasonably be expected to be aware of this in their own code.

> What do you suggest gdb should do?

I'm not sure. I'm just starting to dwell in this part of gdb, so I'm gonna need a bit more inspecting before I can make a proper suggestion.
Comment 10 Pedro Alves 2016-06-23 19:58:25 UTC
Well, I actually said:

 "IF there's synchronization (mutex/futex/whatnot) between
 the parent/child threads inside glibc's pthread_create"

New emphasis on the IF.  I haven't looked at the code in that level of detail. 
I was merely pointing out a _potential_ problem with the use of schedlock in the new glibc printers you're working on (as discussed in the libc-alpha@).

(It's still true that the child manages to run a bit with older gdbs.  But with your observation, that ends up not making a practical difference for the issues at hand.)

> What this means is, the child waits for the parent and not the other way around.

OK, then the concern is moot.  (unless glibc changes, of course.)