14829 – sched_* functions wrongly alter thread scheduling, rather than process

Bug 14829 - sched_* functions wrongly alter thread scheduling, rather than process

Summary: sched_* functions wrongly alter thread scheduling, rather than process

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	nptl (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	2.30
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Duplicates (1):	15088 (view as bug list)
Depends on:
Blocks:

Reported:	2012-11-11 19:31 UTC by Rich Felker
Modified:	2019-08-29 14:39 UTC (History)
CC List:	3 users (show)

See Also:	13701
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Rich Felker 2012-11-11 19:31:00 UTC

Per POSIX (XSH 2.8.4 Process Scheduling, subheading Scheduling Policies):

"For threads with system scheduling contention scope, the process scheduling attributes shall have no effect on the scheduling attributes or behavior either of the thread or an underlying kernel scheduling entity dedicated to that thread."

Linux does not seem to support process-level scheduling attributes at all; glibc/NPTL only accept system scheduling contention scope, and returns ENOTSUP when an attempt is made to create a thread with process contention scope. As such, my reading of the above text is that the sched_* functions should be mostly/entirely no-ops. Even if this weren't the case, it's certainly incorrect for them to modify the scheduling attributes of the single thread whose TID happens to be equal to the PID.

I'm unsure what the best way to resolve this issue against application expectations is, but the current behavior is non-conforming and does the wrong thing in multi-threaded applications trying to use these functions for their standard purposes.

Comment 1 Carlos O'Donell 2013-06-14 16:36:47 UTC

I agree that this a mess in Linux right now.

I agree that most of the sched_* functions should actually do nothing in glibc, principally because process level contention scheduling is not supported.

At present these functions change the scheduling for the thread leader whose tid == tgid and which is returned from the call to getpid(). I agree that this is wrong according to POSIX. Thus the functions do not operate as required by POSIX for threaded applications

These functions do operate correctly for non-threaded applications.

One solution is to version the interfaces, and have all of the new version do nothing if threads are active.

When Linux gets proper process scheduling support we can stop doing nothing and do the right sequence of operations to set the pid's (or tgid's) scheduling parameters.

Linux needs to be enhanced to do the following:
- getpid returns pid
- gettid returns tid != pid

All processes whould start off with a unique tgid that can be used as the target of these calls and that is returned via getpid, and that can't be confused with the thread leader.

That would allow these calls to identify the tgid uqiquely.

Comment 2 Rich Felker 2013-06-16 04:46:32 UTC

Carlos, there are major issues with your proposed solution. I know this one applies to musl and I believe it also applies to glibc: the set_tid_address syscall, used when initializing the main thread's TLS, returns the tid, which is also the pid; to avoid an additional wasteful getpid syscall, this value is used for filling in both the tid and pid fields in the TCB. If the kernel were changed so that the main thread's tid were different from its pid, existing code would experience subtle but serious breakage. I think such a change would be contrary to kernel policy.

I also disagree with the claim "These functions do operate correctly for non-threaded applications." The POSIX requirements for sched_* are essentially that they do nothing, even for single-threaded applications:

"For threads with system scheduling contention scope, the process scheduling attributes shall have no effect on the scheduling attributes or behavior either of the thread or an underlying kernel scheduling entity dedicated to that thread."

Since Linux only supports system scheduling contention scope, not process scheduling contention scope, the above text applies even to the main thread of a single-threaded process.

Even if the process scheduling contention scope were supported, I don't think the semantics would match the current Linux sched_* syscalls.

If it's desirable to support the Process Scheduling option of POSIX, I think Linux needs to add new syscalls (perhaps named posix_sched_*), not just adjust the target scope of the current ones based on whether the argument is a pid or a tid. Short of that, I'm not entirely sure what the appropriate course of action for glibc is. I think a good first step would be to redefine _POSIX_PRIORITY_SCHEDULING to -1, reflecting that glibc does not support the Process Scheduling option even at compile-time. With this macro defined as -1, the implementation is not required to provide sched_* functions, and if these functions do happen to exist as they do in glibc, the implementation is under no obligation to endow them with semantics matching the POSIX semantics. This would at least fix the conformance issue and make conforming applications aware that they cannot use these functions.

Comment 3 Carlos O'Donell 2013-06-17 19:34:51 UTC

(In reply to Rich Felker from comment #2)
> Carlos, there are major issues with your proposed solution. I know this one
> applies to musl and I believe it also applies to glibc: the set_tid_address
> syscall, used when initializing the main thread's TLS, returns the tid,
> which is also the pid; to avoid an additional wasteful getpid syscall, this
> value is used for filling in both the tid and pid fields in the TCB. If the
> kernel were changed so that the main thread's tid were different from its
> pid, existing code would experience subtle but serious breakage. I think
> such a change would be contrary to kernel policy.

The only uses which would break are AFAICT:

* tgkill
- Which could be enhanced to accept a tid, find the tgid, and then kill the associated group.

* rt_tgsigqueueinfo
- Same problem.

They can be solved because the expected semantics is that you call these for a thread group, and if you instead pass a tid, it is a one-way translation from thread to thread group. Even with cgroups I don't think this invariant is broken.

Did I miss anything?

> I also disagree with the claim "These functions do operate correctly for
> non-threaded applications." The POSIX requirements for sched_* are
> essentially that they do nothing, even for single-threaded applications:
> 
> "For threads with system scheduling contention scope, the process scheduling
> attributes shall have no effect on the scheduling attributes or behavior
> either of the thread or an underlying kernel scheduling entity dedicated to
> that thread."
> 
> Since Linux only supports system scheduling contention scope, not process
> scheduling contention scope, the above text applies even to the main thread
> of a single-threaded process.

Sorry, let me clarify what I meant to say. With a relatively minor fix we could support PTHREAD_SCOPE_PROCESS for a single-threaded application. The case of inter-thread scheduling degenerates to unspecified so as long as we do nothing for PTHREAD_SCOPE_SYSTEM and do exactly was we do now for PTHREAD_SCOPE_PROCESS.

Does that make sense?

> Even if the process scheduling contention scope were supported, I don't
> think the semantics would match the current Linux sched_* syscalls.

It would certainly involve kernel work. At present Linux doesn't support process scheduling contention scope.

> If it's desirable to support the Process Scheduling option of POSIX, I think
> Linux needs to add new syscalls (perhaps named posix_sched_*), not just
> adjust the target scope of the current ones based on whether the argument is
> a pid or a tid. Short of that, I'm not entirely sure what the appropriate
> course of action for glibc is. I think a good first step would be to
> redefine _POSIX_PRIORITY_SCHEDULING to -1, reflecting that glibc does not
> support the Process Scheduling option even at compile-time. With this macro
> defined as -1, the implementation is not required to provide sched_*
> functions, and if these functions do happen to exist as they do in glibc,
> the implementation is under no obligation to endow them with semantics
> matching the POSIX semantics. This would at least fix the conformance issue
> and make conforming applications aware that they cannot use these functions.

Agreed that that is the best way forward.

Comment 4 Rich Felker 2013-06-17 20:02:31 UTC

On Mon, Jun 17, 2013 at 07:34:51PM +0000, carlos at redhat dot com wrote:
> The only uses which would break are AFAICT:
> 
> * tgkill
> - Which could be enhanced to accept a tid, find the tgid, and then kill the
> associated group.
> 
> * rt_tgsigqueueinfo
> - Same problem.
> 
> They can be solved because the expected semantics is that you call these for a
> thread group, and if you instead pass a tid, it is a one-way translation from
> thread to thread group. Even with cgroups I don't think this invariant is
> broken.
> 
> Did I miss anything?

Yes. getpid() uses the cached pid from the TCB. So if set_tid_address
returned the tid, then getpid() would return the wrong pid. On the
other hand, if set_tid_address returned the pid, and all syscalls that
accept tids would also accept a pid and treat it as the tid of the
initial thread in that process, then things might not be so bad, but I
think there might be risks involved in threads having two distinct
identifiers, especially when the tid is used to mark the owner of a
synchronization object. I'm not sure that there would be problems with
this approach, but I still think it sounds risky.

Comment 5 Carlos O'Donell 2013-06-17 20:44:43 UTC

(In reply to Rich Felker from comment #4)
> On Mon, Jun 17, 2013 at 07:34:51PM +0000, carlos at redhat dot com wrote:
> > The only uses which would break are AFAICT:
> > 
> > * tgkill
> > - Which could be enhanced to accept a tid, find the tgid, and then kill the
> > associated group.
> > 
> > * rt_tgsigqueueinfo
> > - Same problem.
> > 
> > They can be solved because the expected semantics is that you call these for a
> > thread group, and if you instead pass a tid, it is a one-way translation from
> > thread to thread group. Even with cgroups I don't think this invariant is
> > broken.
> > 
> > Did I miss anything?
> 
> Yes. getpid() uses the cached pid from the TCB. So if set_tid_address
> returned the tid, then getpid() would return the wrong pid. 

It's not the wrong pid, it's the identifier of the schedulable object which is the kernel task associated with the process. It's as correct as you can get.

The kernel has always treated pid and tid as the same kind of identifier for schedulable objects, so yes, calling any syscall that accepts pid will accept tid. The Linux kernel makes no difference between the two.

The exception is tgkill and rt_tgsigqueueinfo, both which assert that the id of the thread group leader matches the id passed into the function as the tgid. This is why these two functions need slightly looser semantics.

And this specific code:
kernel/signal.c:
~~~
static int
do_send_specific(pid_t tgid, pid_t pid, int sig, struct siginfo *info)
{
        struct task_struct *p;
        int error = -ESRCH;

        rcu_read_lock();
        p = find_task_by_vpid(pid);
        if (p && (tgid <= 0 || task_tgid_vnr(p) == tgid)) {
                error = check_kill_permission(sig, info, p);
~~~
Should use:
~~~
static int
do_send_specific(pid_t tgid, pid_t pid, int sig, struct siginfo *info)
{
        struct task_struct *p, *pt;
        int error = -ESRCH;

        rcu_read_lock();
        p = find_task_by_vpid(pid);
        pt = find_task_by_vpid(tgid);
        if (p && (tgid <= 0 || task_tgid_vnr(p) == task_tgid_vnr(pt))) {
                error = check_kill_permission(sig, info, p);
~~~
Thus allowing any schedulable object to be passed in as the tgid,
and that object be used to find the thread group leader for doing
the work.

> On the other hand, if set_tid_address returned the pid, and all syscalls that
> accept tids would also accept a pid and treat it as the tid of the
> initial thread in that process, then things might not be so bad, but I
> think there might be risks involved in threads having two distinct
> identifiers, especially when the tid is used to mark the owner of a
> synchronization object. 

All kernel functions already do accept pid and tid interchangeably.

You can already have several different identifiers for the same task.

e.g.

include/linux/pid.h
~~~
enum pid_type
{
        PIDTYPE_PID,
        PIDTYPE_PGID,
        PIDTYPE_SID,
        PIDTYPE_MAX
};
~~~

> I'm not sure that there would be problems with
> this approach, but I still think it sounds risky.

AFAIK it would work, modulo the fix.

The return of getpid() would be the tid, and it would work.

However, the more I write out the design the uglier it gets.

It's probably better to just add another PIDTYPE_* value to support process scheduling and a new syscall.

Comment 6 Rich Felker 2013-06-17 21:02:25 UTC

On Mon, Jun 17, 2013 at 08:44:43PM +0000, carlos at redhat dot com wrote:
> > > Did I miss anything?
> > 
> > Yes. getpid() uses the cached pid from the TCB. So if set_tid_address
> > returned the tid, then getpid() would return the wrong pid. 
> 
> It's not the wrong pid, it's the identifier of the schedulable object which is
> the kernel task associated with the process. It's as correct as you can get.

No. If fork() returns a value X to a parent process, and the child
process sees its own pid as value Y via getpid(), then the
implementation is non-conforming.

> All kernel functions already do accept pid and tid interchangeably.
> 
> You can already have several different identifiers for the same task.

There is nothing wrong with the kernel _accepting_ different
identifiers for the same task. However standard functions which
_return_ the pid must return consistent values. The change you
proposed for the kernel would break this requirement in existing
static binaries and when using old shared libc versions from before
the kernel change. As far as I know, this goes against kernel
interface stability policy.

Comment 7 Florian Weimer 2014-06-17 12:47:02 UTC

I hate to ask this, but could we use the SETXID mechanism to paper over this missing kernel feature?

Comment 8 Rich Felker 2014-06-17 13:54:22 UTC

No. For at least a couple reasons:

1. SETXID is for a process acting on itself. sched_* can target any PID. Having glibc accept SETXID signals from other processes would impose some serious compatibility issues: the sender and receiver would have to have matching ideas of the protocol. The locking to get it right would also be difficult or impossible, and there may be permissions issues about sending the signals.

2. Changing the scheduling parameters of a process is not the same thing as changing the parameters for each thread of the process. Per POSIX, the sched_* functions are specified to be almost complete NOPs, aside from setting and retrieving values and checking permissions, etc., on systems that do not support a process scheduling scope (Linux is such a system). So making them change all threads would just replace one wrong behavior with another.

Comment 9 Florian Weimer 2018-12-07 12:31:57 UTC

This affects functions such as sched_setscheduler and sched_setparam.

(Hopefully this will make this bug easier to find.)

Comment 10 Florian Weimer 2018-12-07 12:33:07 UTC

*** Bug 15088 has been marked as a duplicate of this bug. ***

Comment 11 Sourceware Commits 2019-02-02 13:17:17 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  c70824b9a4645c0ecd049da8cfdb2c28ae7ada23 (commit)
      from  4dcbbc3b28aaeafe23e1a30db84055aa6f6fa987 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c70824b9a4645c0ecd049da8cfdb2c28ae7ada23

commit c70824b9a4645c0ecd049da8cfdb2c28ae7ada23
Author: Florian Weimer <fweimer@redhat.com>
Date:   Sat Feb 2 14:13:33 2019 +0100

    manual: Document lack of conformance of sched_* functions [BZ #14829]
    
    On Linux, we define _POSIX_PRIORITY_SCHEDULING, but functions such
    as sched_setparam and sched_setscheduler apply to individual threads,
    not processes.
    
    Reviewed-by: Carlos O'Donell <carlos@redhat.com>

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                                |   10 +++++
 manual/resource.texi                     |   60 +++++++++++++++++-------------
 sysdeps/unix/sysv/linux/bits/posix_opt.h |    5 ++-
 3 files changed, 48 insertions(+), 27 deletions(-)

Comment 12 Florian Weimer 2019-08-29 14:39:06 UTC

This was fixed with a documentation change in 2.30.