This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

What *is* the API for sched_getaffinity? Should sched_getaffinity always succeed when using cpu_set_t?


Community,

This is an expansion of:
http://sourceware.org/bugzilla/show_bug.cgi?id=15630

The glibc functions sched_getaffinity and sched_setaffinity 
have slightly different semantics than the kernel sched_getaffinity 
and sched_setaffinity functions.

The result is that if you boot in a system with more than 1024 
possible cpus, and you use a fixed cpu_set_t with sched_getaffinity, 
the call will never succeed and will always return EINVAL. This is
because the kernel API documents that it will return EINVAL if the
user memory is too small for the mask size.

Note that the glibc manual page does not document sched_getaffinity
as returning EINVAL.

However, the Linux kernel man pages project documents the interface
as returning EINVAL (because that's what it does).

Therefore some applications may expect the API to return EINVAL when
the kernel's affinity mask is larger than the size of the object 
you passed in to store the mask.

The question for the community is:

(1) Should the call to sched_getaffinity should always succeed?

or

(2) Should the call to sched_getaffinity fail if the kernel affinity
    mask is larger than the passed in object?

This is not a hypothetical problem, I am already seeing users 
with this problem.

My opinion is that (1) is the correct way forward. The reason I
say this is because the user should not be exposed to the vagaries
of the size of the kernel affinity mask. Worse is that it's non-optimal
to use the return of EINVAL to scale up the size of the cpu_set_t
dynamically. Users should always be using sysconf.

Siddhesh (paraphrasing here) thinks that either (1) or (2) is
going to cause some group of users to be upset, but (2) being
existing behaviour is more conservative (tough it involves telling
users that "Yes, using cpu_set_t might actually fail even if you
didn't expect it."

Let us talk more about the API differences and what can be done
in glibc to mitigate the problem.

The most important difference is that if you call either of
the kernel routines with a cpusetsize that is smaller than the 
kernel's possible cpu mask size the kernel syscall returns EINVAL. 

The kernel previously did accounting based on the configured 
maximum rather than possible cpus, leading to problems if you'd
simply compiled with NR_CPUS > 1024 instead of actually booting
on a system where the low-level firmware detected > 1024
possible CPUs.

There are 3 ways to determine the correct size of the possible
cpu mask size:

(a) Read it from sysfs /sys/devices/system/cpu/online, which 
has the actual number of possibly online cpus.

(b) Interpret /proc/cpuinfo or /proc/stat.

(c) Call the kernel syscall sched_getaffinity with increasingly
larger values for cpusetsize in an attempt to manually 
determine the cpu mask size.

Methods (a) and (b) are already used by sysconf(_SC_PROCESSORS_ONLN)
to determine the value to return.

Method (c) is used by sched_setaffinity to determine the size 
of the kernel mask and then reject any bits which are set outside
of the mask and return EINVAL.

Method (c) is recommended by a patched RHEL man page [1] for 
sched_getaffinity, but that patch has not made it upstream to 
the Linux Kernel man pages project.

In solution (1) the goal is to make using a fixed cpu_set_t work
at all times, but only support the first 1024 cpus. To support 
more than 1024 cpus you need to use the dynamically sized 
macros and method (a) (if you want all the cpus).

In order to make a fixed cpu_set_t size work all the time the
following changes need to be made to glibc:

(s1) Enhance sysconf(_SC_PROCESSORS_ONLN) to additionally use 
method (c) as a last resort to determine the number of online 
cpus. In addition sysconf should cache the value for the 
lifetime of the process. The code in sysconf should be the 
only place we cache the value (currently we also cache it 
in sched_setaffinity).
- Can be done as a distinct cleanup step.

(s2) Cleanup sched_setaffinity to call sysconf to determine 
the number of online cpus and use that to check if the 
incoming bitmask is valid. Additionally if possible we 
should check for non-zero entries a long at a time instead 
of a byte at a time.
- Can be done as a distinct cleanup step.

(s3) Fix sched_getaffinity and have it call sysconf to 
determine the number of online cpus and use that to get 
the kernel cpu mask affinity values, copying back the 
minimum of the sizes, either user or kernel, and zeroing 
the rest. This call should never fail.
- This is the real fix.

Static applications can't easily be fixed to work around 
this problem. The only solution there is to have the kernel 
stop returning EINVAL and instead do what glibc does which 
is to copy only the part of the buffer that the user requested. 
However, doing that would break existing glibc's which rely 
on EINVAL to compute the mask size. Therefore changing the 
kernel semantics are not a good solution (except on a 
system-by-system basis in the extreme case where a single 
static application was being supported).

Step (s3) ensures that using a fixed cpu_set_t size works 
when you are booted on hardware that has more than 1024 
possible cpus.

Unfortunately it breaks the recommended pattern of using 
sched_getaffinity and looking for EINVAL to determine the 
size of the mask, but this was never a method that glibc 
documented or supported. The patched man page has the 
starting buffer size of 1024, so at least such a pattern 
would allow access to the first 1024 cpus. It is strongly 
recommended that users use sysconf to determine the number 
of possible cpus. This patch to the linux kernel man pages
has been removed in future versions of RHEL.

In summary:

Before doing (s3) we should make a decision on (1) vs. (20.

Cheers,
Carlos.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=974679


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]