Per POSIX (XSH 2.8.4 Process Scheduling, subheading Scheduling Policies): "For threads with system scheduling contention scope, the process scheduling attributes shall have no effect on the scheduling attributes or behavior either of the thread or an underlying kernel scheduling entity dedicated to that thread." Linux does not seem to support process-level scheduling attributes at all; glibc/NPTL only accept system scheduling contention scope, and returns ENOTSUP when an attempt is made to create a thread with process contention scope. As such, my reading of the above text is that the sched_* functions should be mostly/entirely no-ops. Even if this weren't the case, it's certainly incorrect for them to modify the scheduling attributes of the single thread whose TID happens to be equal to the PID. I'm unsure what the best way to resolve this issue against application expectations is, but the current behavior is non-conforming and does the wrong thing in multi-threaded applications trying to use these functions for their standard purposes.
I agree that this a mess in Linux right now. I agree that most of the sched_* functions should actually do nothing in glibc, principally because process level contention scheduling is not supported. At present these functions change the scheduling for the thread leader whose tid == tgid and which is returned from the call to getpid(). I agree that this is wrong according to POSIX. Thus the functions do not operate as required by POSIX for threaded applications These functions do operate correctly for non-threaded applications. One solution is to version the interfaces, and have all of the new version do nothing if threads are active. When Linux gets proper process scheduling support we can stop doing nothing and do the right sequence of operations to set the pid's (or tgid's) scheduling parameters. Linux needs to be enhanced to do the following: - getpid returns pid - gettid returns tid != pid All processes whould start off with a unique tgid that can be used as the target of these calls and that is returned via getpid, and that can't be confused with the thread leader. That would allow these calls to identify the tgid uqiquely.
Carlos, there are major issues with your proposed solution. I know this one applies to musl and I believe it also applies to glibc: the set_tid_address syscall, used when initializing the main thread's TLS, returns the tid, which is also the pid; to avoid an additional wasteful getpid syscall, this value is used for filling in both the tid and pid fields in the TCB. If the kernel were changed so that the main thread's tid were different from its pid, existing code would experience subtle but serious breakage. I think such a change would be contrary to kernel policy. I also disagree with the claim "These functions do operate correctly for non-threaded applications." The POSIX requirements for sched_* are essentially that they do nothing, even for single-threaded applications: "For threads with system scheduling contention scope, the process scheduling attributes shall have no effect on the scheduling attributes or behavior either of the thread or an underlying kernel scheduling entity dedicated to that thread." Since Linux only supports system scheduling contention scope, not process scheduling contention scope, the above text applies even to the main thread of a single-threaded process. Even if the process scheduling contention scope were supported, I don't think the semantics would match the current Linux sched_* syscalls. If it's desirable to support the Process Scheduling option of POSIX, I think Linux needs to add new syscalls (perhaps named posix_sched_*), not just adjust the target scope of the current ones based on whether the argument is a pid or a tid. Short of that, I'm not entirely sure what the appropriate course of action for glibc is. I think a good first step would be to redefine _POSIX_PRIORITY_SCHEDULING to -1, reflecting that glibc does not support the Process Scheduling option even at compile-time. With this macro defined as -1, the implementation is not required to provide sched_* functions, and if these functions do happen to exist as they do in glibc, the implementation is under no obligation to endow them with semantics matching the POSIX semantics. This would at least fix the conformance issue and make conforming applications aware that they cannot use these functions.
(In reply to Rich Felker from comment #2) > Carlos, there are major issues with your proposed solution. I know this one > applies to musl and I believe it also applies to glibc: the set_tid_address > syscall, used when initializing the main thread's TLS, returns the tid, > which is also the pid; to avoid an additional wasteful getpid syscall, this > value is used for filling in both the tid and pid fields in the TCB. If the > kernel were changed so that the main thread's tid were different from its > pid, existing code would experience subtle but serious breakage. I think > such a change would be contrary to kernel policy. The only uses which would break are AFAICT: * tgkill - Which could be enhanced to accept a tid, find the tgid, and then kill the associated group. * rt_tgsigqueueinfo - Same problem. They can be solved because the expected semantics is that you call these for a thread group, and if you instead pass a tid, it is a one-way translation from thread to thread group. Even with cgroups I don't think this invariant is broken. Did I miss anything? > I also disagree with the claim "These functions do operate correctly for > non-threaded applications." The POSIX requirements for sched_* are > essentially that they do nothing, even for single-threaded applications: > > "For threads with system scheduling contention scope, the process scheduling > attributes shall have no effect on the scheduling attributes or behavior > either of the thread or an underlying kernel scheduling entity dedicated to > that thread." > > Since Linux only supports system scheduling contention scope, not process > scheduling contention scope, the above text applies even to the main thread > of a single-threaded process. Sorry, let me clarify what I meant to say. With a relatively minor fix we could support PTHREAD_SCOPE_PROCESS for a single-threaded application. The case of inter-thread scheduling degenerates to unspecified so as long as we do nothing for PTHREAD_SCOPE_SYSTEM and do exactly was we do now for PTHREAD_SCOPE_PROCESS. Does that make sense? > Even if the process scheduling contention scope were supported, I don't > think the semantics would match the current Linux sched_* syscalls. It would certainly involve kernel work. At present Linux doesn't support process scheduling contention scope. > If it's desirable to support the Process Scheduling option of POSIX, I think > Linux needs to add new syscalls (perhaps named posix_sched_*), not just > adjust the target scope of the current ones based on whether the argument is > a pid or a tid. Short of that, I'm not entirely sure what the appropriate > course of action for glibc is. I think a good first step would be to > redefine _POSIX_PRIORITY_SCHEDULING to -1, reflecting that glibc does not > support the Process Scheduling option even at compile-time. With this macro > defined as -1, the implementation is not required to provide sched_* > functions, and if these functions do happen to exist as they do in glibc, > the implementation is under no obligation to endow them with semantics > matching the POSIX semantics. This would at least fix the conformance issue > and make conforming applications aware that they cannot use these functions. Agreed that that is the best way forward.
On Mon, Jun 17, 2013 at 07:34:51PM +0000, carlos at redhat dot com wrote: > The only uses which would break are AFAICT: > > * tgkill > - Which could be enhanced to accept a tid, find the tgid, and then kill the > associated group. > > * rt_tgsigqueueinfo > - Same problem. > > They can be solved because the expected semantics is that you call these for a > thread group, and if you instead pass a tid, it is a one-way translation from > thread to thread group. Even with cgroups I don't think this invariant is > broken. > > Did I miss anything? Yes. getpid() uses the cached pid from the TCB. So if set_tid_address returned the tid, then getpid() would return the wrong pid. On the other hand, if set_tid_address returned the pid, and all syscalls that accept tids would also accept a pid and treat it as the tid of the initial thread in that process, then things might not be so bad, but I think there might be risks involved in threads having two distinct identifiers, especially when the tid is used to mark the owner of a synchronization object. I'm not sure that there would be problems with this approach, but I still think it sounds risky.
(In reply to Rich Felker from comment #4) > On Mon, Jun 17, 2013 at 07:34:51PM +0000, carlos at redhat dot com wrote: > > The only uses which would break are AFAICT: > > > > * tgkill > > - Which could be enhanced to accept a tid, find the tgid, and then kill the > > associated group. > > > > * rt_tgsigqueueinfo > > - Same problem. > > > > They can be solved because the expected semantics is that you call these for a > > thread group, and if you instead pass a tid, it is a one-way translation from > > thread to thread group. Even with cgroups I don't think this invariant is > > broken. > > > > Did I miss anything? > > Yes. getpid() uses the cached pid from the TCB. So if set_tid_address > returned the tid, then getpid() would return the wrong pid. It's not the wrong pid, it's the identifier of the schedulable object which is the kernel task associated with the process. It's as correct as you can get. The kernel has always treated pid and tid as the same kind of identifier for schedulable objects, so yes, calling any syscall that accepts pid will accept tid. The Linux kernel makes no difference between the two. The exception is tgkill and rt_tgsigqueueinfo, both which assert that the id of the thread group leader matches the id passed into the function as the tgid. This is why these two functions need slightly looser semantics. And this specific code: kernel/signal.c: ~~~ static int do_send_specific(pid_t tgid, pid_t pid, int sig, struct siginfo *info) { struct task_struct *p; int error = -ESRCH; rcu_read_lock(); p = find_task_by_vpid(pid); if (p && (tgid <= 0 || task_tgid_vnr(p) == tgid)) { error = check_kill_permission(sig, info, p); ~~~ Should use: ~~~ static int do_send_specific(pid_t tgid, pid_t pid, int sig, struct siginfo *info) { struct task_struct *p, *pt; int error = -ESRCH; rcu_read_lock(); p = find_task_by_vpid(pid); pt = find_task_by_vpid(tgid); if (p && (tgid <= 0 || task_tgid_vnr(p) == task_tgid_vnr(pt))) { error = check_kill_permission(sig, info, p); ~~~ Thus allowing any schedulable object to be passed in as the tgid, and that object be used to find the thread group leader for doing the work. > On the other hand, if set_tid_address returned the pid, and all syscalls that > accept tids would also accept a pid and treat it as the tid of the > initial thread in that process, then things might not be so bad, but I > think there might be risks involved in threads having two distinct > identifiers, especially when the tid is used to mark the owner of a > synchronization object. All kernel functions already do accept pid and tid interchangeably. You can already have several different identifiers for the same task. e.g. include/linux/pid.h ~~~ enum pid_type { PIDTYPE_PID, PIDTYPE_PGID, PIDTYPE_SID, PIDTYPE_MAX }; ~~~ > I'm not sure that there would be problems with > this approach, but I still think it sounds risky. AFAIK it would work, modulo the fix. The return of getpid() would be the tid, and it would work. However, the more I write out the design the uglier it gets. It's probably better to just add another PIDTYPE_* value to support process scheduling and a new syscall.
On Mon, Jun 17, 2013 at 08:44:43PM +0000, carlos at redhat dot com wrote: > > > Did I miss anything? > > > > Yes. getpid() uses the cached pid from the TCB. So if set_tid_address > > returned the tid, then getpid() would return the wrong pid. > > It's not the wrong pid, it's the identifier of the schedulable object which is > the kernel task associated with the process. It's as correct as you can get. No. If fork() returns a value X to a parent process, and the child process sees its own pid as value Y via getpid(), then the implementation is non-conforming. > All kernel functions already do accept pid and tid interchangeably. > > You can already have several different identifiers for the same task. There is nothing wrong with the kernel _accepting_ different identifiers for the same task. However standard functions which _return_ the pid must return consistent values. The change you proposed for the kernel would break this requirement in existing static binaries and when using old shared libc versions from before the kernel change. As far as I know, this goes against kernel interface stability policy.
I hate to ask this, but could we use the SETXID mechanism to paper over this missing kernel feature?
No. For at least a couple reasons: 1. SETXID is for a process acting on itself. sched_* can target any PID. Having glibc accept SETXID signals from other processes would impose some serious compatibility issues: the sender and receiver would have to have matching ideas of the protocol. The locking to get it right would also be difficult or impossible, and there may be permissions issues about sending the signals. 2. Changing the scheduling parameters of a process is not the same thing as changing the parameters for each thread of the process. Per POSIX, the sched_* functions are specified to be almost complete NOPs, aside from setting and retrieving values and checking permissions, etc., on systems that do not support a process scheduling scope (Linux is such a system). So making them change all threads would just replace one wrong behavior with another.
This affects functions such as sched_setscheduler and sched_setparam. (Hopefully this will make this bug easier to find.)
*** Bug 15088 has been marked as a duplicate of this bug. ***
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via c70824b9a4645c0ecd049da8cfdb2c28ae7ada23 (commit) from 4dcbbc3b28aaeafe23e1a30db84055aa6f6fa987 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c70824b9a4645c0ecd049da8cfdb2c28ae7ada23 commit c70824b9a4645c0ecd049da8cfdb2c28ae7ada23 Author: Florian Weimer <fweimer@redhat.com> Date: Sat Feb 2 14:13:33 2019 +0100 manual: Document lack of conformance of sched_* functions [BZ #14829] On Linux, we define _POSIX_PRIORITY_SCHEDULING, but functions such as sched_setparam and sched_setscheduler apply to individual threads, not processes. Reviewed-by: Carlos O'Donell <carlos@redhat.com> ----------------------------------------------------------------------- Summary of changes: ChangeLog | 10 +++++ manual/resource.texi | 60 +++++++++++++++++------------- sysdeps/unix/sysv/linux/bits/posix_opt.h | 5 ++- 3 files changed, 48 insertions(+), 27 deletions(-)
This was fixed with a documentation change in 2.30.