This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
What *is* the API for sched_getaffinity? Should sched_getaffinity always succeed when using cpu_set_t?
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: GNU C Library <libc-alpha at sourceware dot org>, Siddhesh Poyarekar <siddhesh at redhat dot com>, Roland McGrath <roland at hack dot frob dot com>
- Date: Mon, 15 Jul 2013 13:06:06 -0400
- Subject: What *is* the API for sched_getaffinity? Should sched_getaffinity always succeed when using cpu_set_t?
Community,
This is an expansion of:
http://sourceware.org/bugzilla/show_bug.cgi?id=15630
The glibc functions sched_getaffinity and sched_setaffinity
have slightly different semantics than the kernel sched_getaffinity
and sched_setaffinity functions.
The result is that if you boot in a system with more than 1024
possible cpus, and you use a fixed cpu_set_t with sched_getaffinity,
the call will never succeed and will always return EINVAL. This is
because the kernel API documents that it will return EINVAL if the
user memory is too small for the mask size.
Note that the glibc manual page does not document sched_getaffinity
as returning EINVAL.
However, the Linux kernel man pages project documents the interface
as returning EINVAL (because that's what it does).
Therefore some applications may expect the API to return EINVAL when
the kernel's affinity mask is larger than the size of the object
you passed in to store the mask.
The question for the community is:
(1) Should the call to sched_getaffinity should always succeed?
or
(2) Should the call to sched_getaffinity fail if the kernel affinity
mask is larger than the passed in object?
This is not a hypothetical problem, I am already seeing users
with this problem.
My opinion is that (1) is the correct way forward. The reason I
say this is because the user should not be exposed to the vagaries
of the size of the kernel affinity mask. Worse is that it's non-optimal
to use the return of EINVAL to scale up the size of the cpu_set_t
dynamically. Users should always be using sysconf.
Siddhesh (paraphrasing here) thinks that either (1) or (2) is
going to cause some group of users to be upset, but (2) being
existing behaviour is more conservative (tough it involves telling
users that "Yes, using cpu_set_t might actually fail even if you
didn't expect it."
Let us talk more about the API differences and what can be done
in glibc to mitigate the problem.
The most important difference is that if you call either of
the kernel routines with a cpusetsize that is smaller than the
kernel's possible cpu mask size the kernel syscall returns EINVAL.
The kernel previously did accounting based on the configured
maximum rather than possible cpus, leading to problems if you'd
simply compiled with NR_CPUS > 1024 instead of actually booting
on a system where the low-level firmware detected > 1024
possible CPUs.
There are 3 ways to determine the correct size of the possible
cpu mask size:
(a) Read it from sysfs /sys/devices/system/cpu/online, which
has the actual number of possibly online cpus.
(b) Interpret /proc/cpuinfo or /proc/stat.
(c) Call the kernel syscall sched_getaffinity with increasingly
larger values for cpusetsize in an attempt to manually
determine the cpu mask size.
Methods (a) and (b) are already used by sysconf(_SC_PROCESSORS_ONLN)
to determine the value to return.
Method (c) is used by sched_setaffinity to determine the size
of the kernel mask and then reject any bits which are set outside
of the mask and return EINVAL.
Method (c) is recommended by a patched RHEL man page [1] for
sched_getaffinity, but that patch has not made it upstream to
the Linux Kernel man pages project.
In solution (1) the goal is to make using a fixed cpu_set_t work
at all times, but only support the first 1024 cpus. To support
more than 1024 cpus you need to use the dynamically sized
macros and method (a) (if you want all the cpus).
In order to make a fixed cpu_set_t size work all the time the
following changes need to be made to glibc:
(s1) Enhance sysconf(_SC_PROCESSORS_ONLN) to additionally use
method (c) as a last resort to determine the number of online
cpus. In addition sysconf should cache the value for the
lifetime of the process. The code in sysconf should be the
only place we cache the value (currently we also cache it
in sched_setaffinity).
- Can be done as a distinct cleanup step.
(s2) Cleanup sched_setaffinity to call sysconf to determine
the number of online cpus and use that to check if the
incoming bitmask is valid. Additionally if possible we
should check for non-zero entries a long at a time instead
of a byte at a time.
- Can be done as a distinct cleanup step.
(s3) Fix sched_getaffinity and have it call sysconf to
determine the number of online cpus and use that to get
the kernel cpu mask affinity values, copying back the
minimum of the sizes, either user or kernel, and zeroing
the rest. This call should never fail.
- This is the real fix.
Static applications can't easily be fixed to work around
this problem. The only solution there is to have the kernel
stop returning EINVAL and instead do what glibc does which
is to copy only the part of the buffer that the user requested.
However, doing that would break existing glibc's which rely
on EINVAL to compute the mask size. Therefore changing the
kernel semantics are not a good solution (except on a
system-by-system basis in the extreme case where a single
static application was being supported).
Step (s3) ensures that using a fixed cpu_set_t size works
when you are booted on hardware that has more than 1024
possible cpus.
Unfortunately it breaks the recommended pattern of using
sched_getaffinity and looking for EINVAL to determine the
size of the mask, but this was never a method that glibc
documented or supported. The patched man page has the
starting buffer size of 1024, so at least such a pattern
would allow access to the first 1024 cpus. It is strongly
recommended that users use sysconf to determine the number
of possible cpus. This patch to the linux kernel man pages
has been removed in future versions of RHEL.
In summary:
Before doing (s3) we should make a decision on (1) vs. (20.
Cheers,
Carlos.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=974679