Bug 15630

Summary: Fix use of cpu_set_t with sched_getaffinity when booted on a system with more than 1024 possible cpus.
Product: glibc Reporter: Carlos O'Donell <carlos>
Component: libcAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED MOVED    
Severity: normal CC: drepper.fsp, fweimer
Priority: P2 Flags: fweimer: security-
Version: 2.18   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Test case which reflects the truncation assumption (does not currently work)

Description Carlos O'Donell 2013-06-14 19:31:24 UTC
The glibc functions sched_getaffinity and sched_setaffinity have slightly different semantics than the kernel sched_getaffinity and sched_setaffinity functions.

The result is that if you boot in a system with more than 1024 possible cpus, and you use a fixed cpu_set_t with sched_getaffinity, the call will never succeed and will always return EINVAL. The glibc manual page does not document sched_getaffinity as returning EINVAL. The call to sched_getaffinity should always succeed.

This is not a hypothetical problem, I am already seeing users with this problem.

Let us talk more about the API differences and what can be done in glibc to mitigate the problem.

The most important difference is that if you call either of the kernel routines with a cpusetsize that is smaller than the kernel's possible cpu mask size the kernel routines return EINVAL. The kernel previously did accounting based on the configured maximum rather than possible cpus, leading to problems if you'd simply compiled with NR_CPUS > 1024 instead of actually booting on a system where the low-level firmware detected > 1024 possible CPUs.

There are 3 ways to determine the correct size of the possible cpu mask size:

(a) Read it from sysfs /sys/devices/system/cpu/online, which has the actual number of possibly online cpus.

(b) Interpret /proc/cpuinfo or /proc/stat.

(c) Call the kernel syscall sched_getaffinity with increasingly larger values for cpusetsize in an attempt to manually determine the cpu mask size.

Methods (a) and (b) are already used by sysconf(_SC_PROCESSORS_ONLN) to determine the value to return.

Method (c) is used by sched_setaffinity to determine the size of the kernel mask and then reject any bits which are set outside of the mask and return EINVAL.

Method (c) is recommended by a patched RHEL man page [1] for sched_getaffinity, but that patch has not made it upstream to the Linux Kernel man pages project.

The goal is therefore to make using a fixed cpu_set_t work at all times, but only support the first 1024 cpus. To support more than 1024 cpus you need to use the dynamically sized macros and method (a) (if you want all the cpus).

In order to make a fixed cpu_set_t size work all the time the following changes need to be made to glibc:

(1) Enhance sysconf(_SC_PROCESSORS_ONLN) to additionally use method (c) as a last resort to determine the number of online cpus. In addition sysconf should cache the value for the lifetime of the process. The code in sysconf should be the only place we cache the value (currently we also cache it in sched_setaffinity).

(2) Cleanup sched_setaffinity to call sysconf to determine the number of online cpus and use that to check if the incoming bitmask is valid. Additionally if possible we should check for non-zero entries a long at a time instead of a byte at a time.

(3) Fix sched_getaffinity and have it call sysconf to determine the number of online cpus and use that to get the kernel cpu mask affinity values, copying back the minimum of the sizes, either user or kernel, and zeroing the rest. This call should never fail.

Static applications can't easily be fixed to work around this problem. The only solution there is to have the kernel stop returning EINVAL and instead do what glibc does which is to copy only the part of the buffer that the user requested. However, doing that would break existing glibc's which rely on EINVAL to compute the mask size. Therefore changing the kernel semantics are not a good solution (except on a system-by-system basis in the extreme case where a single static application was being supported).

Step (3) ensures that using a fixed cpu_set_t size works when you are booted on hardware that has more than 1024 possible cpus.

Unfortunately it breaks the recommended pattern of using sched_getaffinity and looking for EINVAL to determine the size of the mask, but this was never a method that glibc documented or supported. The patched man page has the starting buffer size of 1024, so at least such a pattern would allow access to the first 1024 cpus. It is strongly recommended that users use sysconf to determine the number of possible cpus.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=974679
Comment 1 Carlos O'Donell 2013-07-23 23:12:29 UTC
Also needing fixing are pthread_getaffinity_np and pthread_setaffinty_np.
Comment 2 Florian Weimer 2015-05-18 15:55:57 UTC
Maybe addressing this should also fix this design issue with the cpu_set_t functions:

CPU_ALLOC returns a set which is capable of storing more CPUs than the user requested because the request is rounded up to the next multiple of sizeof (long) * CHAR_BIT CPUs.  This is reflected in the return value of CPU_ALLOC_SIZE.  As a result, the sched_getaffinity call in

  set = CPU_ALLOC (count);
  sched_getaffinity (0, CPU_ALLOC_SIZE (count), set);

can succeed even if count is smaller than the maximum number of relevant CPUs.  This defeats the purpose of the EINVAL error return code in case the specified CPU set is not large enough.
Comment 3 Florian Weimer 2015-05-18 18:37:30 UTC
Additional details:

Sadly, sysconf(_SC_PROCESSORS_ONLN) not the number of CPUs which is relevant to the sched_getaffinity system call.  I have system which reports _NPROCESSORS_ONLN and _NPROCESSORS_CONF as 40 (and /proc/cpuinfo and /proc/stat match that), yet calling sched_getaffinity with small arguments fails:

[pid  3420] sched_getaffinity(0, 8, 0x146b010) = -1 EINVAL (Invalid argument)
[pid  3420] sched_getaffinity(0, 16, 0x146b010) = -1 EINVAL (Invalid argument)
[pid  3420] sched_getaffinity(0, 32, {ffffffffff, 0, 0, 0}) = 32

The kernel seems to operate with a nr_cpu_ids value of 240:

kernel: setup_percpu: NR_CPUS:5120 nr_cpumask_bits:240 nr_cpu_ids:240 nr_node_ids:2
kernel:         RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=240.

nr_cpu_ids is not directly exposed to user space, I think, so the EINVAL behavior is pretty much required.

If other systems have such inflated nr_cpu_ids values, this could become an issue well before systems with 1024 hardware threads become common.

My laptop has nr_cpu_ids=8 (4 hardware threads), another server has nr_cpu_ids=144, with the same number of hardware threads.  I don't think it's prudent to assume that nr_cpu_ids value remains constant after boot, either.

So I wonder if the premise of this bug report (we can get rid of the EINVAL return value and truncate results) is correct.  There is no other way to obtain the magic constant, and even if you grab the number from somewhere, it may be racy with regards to system reconfiguration.
Comment 4 Florian Weimer 2015-05-18 18:59:23 UTC
A potential solution would accept smaller size arguments for sched_getaffinity, as long as the other bits on the kernel side a cleared.  Tools such as taskset could then restrict the CPU set to the supported CPUs for legacy applications.

This is what we do with RLIMIT_NOFILEā€”it is still 1024 on many systems to prevent issues with the select function and the default FD_SETSIZE value.  Unfortunately, it is somewhat more restrictive.
Comment 5 Florian Weimer 2015-05-18 19:14:54 UTC
Created attachment 8322 [details]
Test case which reflects the truncation assumption (does not currently work)
Comment 6 Carlos O'Donell 2015-05-20 02:54:31 UTC
(In reply to Florian Weimer from comment #2)
> Maybe addressing this should also fix this design issue with the cpu_set_t
> functions:
> 
> CPU_ALLOC returns a set which is capable of storing more CPUs than the user
> requested because the request is rounded up to the next multiple of sizeof
> (long) * CHAR_BIT CPUs.  This is reflected in the return value of
> CPU_ALLOC_SIZE.  As a result, the sched_getaffinity call in
> 
>   set = CPU_ALLOC (count);
>   sched_getaffinity (0, CPU_ALLOC_SIZE (count), set);
> 
> can succeed even if count is smaller than the maximum number of relevant
> CPUs.  This defeats the purpose of the EINVAL error return code in case the
> specified CPU set is not large enough.

Agreed. That is a flaw. The allocation should certainly be large enough to hold all possible CPUs ever (nr_cpumask_bits in kernel speak).

(In reply to Florian Weimer from comment #3)
> Additional details:
> 
> Sadly, sysconf(_SC_PROCESSORS_ONLN) not the number of CPUs which is relevant
> to the sched_getaffinity system call.  I have system which reports
> _NPROCESSORS_ONLN and _NPROCESSORS_CONF as 40 (and /proc/cpuinfo and
> /proc/stat match that), yet calling sched_getaffinity with small arguments
> fails:
> 
> [pid  3420] sched_getaffinity(0, 8, 0x146b010) = -1 EINVAL (Invalid argument)
> [pid  3420] sched_getaffinity(0, 16, 0x146b010) = -1 EINVAL (Invalid
> argument)
> [pid  3420] sched_getaffinity(0, 32, {ffffffffff, 0, 0, 0}) = 32
> 
> The kernel seems to operate with a nr_cpu_ids value of 240:
> 
> kernel: setup_percpu: NR_CPUS:5120 nr_cpumask_bits:240 nr_cpu_ids:240
> nr_node_ids:2
> kernel:         RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=240.
> 
> nr_cpu_ids is not directly exposed to user space, I think, so the EINVAL
> behavior is pretty much required.

It is not. We need to get access to nr_cpumask_bits, and likely the only way to do that is read it from proc like some of the code does today.

See this thread:
https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html
Comment 7 Florian Weimer 2016-02-04 14:39:47 UTC
What is the actual problem that we are trying to fix?

We could make sched_getaffinity succeed if the affinity mask contains only zero bits outside the mask buffer supplied by the application.  But then, the only way to run an application which specifies too small a set would require preventing the application from running on those extra CPUs.  This could leave parts of the machine unused.

We cannot drop discard set bits because some concurrent algorithms need to know on which CPUs they can run.

I do not want to introduce a CPU number offset in user space (which could be used to run multiple small-affinity-set applications on all CPUs).  This is probably something for the kernel.  Container/namespace support for affinity masks may even make sense for other reasons.
Comment 8 Carlos O'Donell 2016-08-09 15:12:53 UTC
(In reply to Florian Weimer from comment #7)
> What is the actual problem that we are trying to fix?

Applications using a fixed cpu_set_t on a machine with more than 1024 CPUs should always just work, but it doesn't.
 
> We could make sched_getaffinity succeed if the affinity mask contains only
> zero bits outside the mask buffer supplied by the application.  But then,
> the only way to run an application which specifies too small a set would
> require preventing the application from running on those extra CPUs.  This
> could leave parts of the machine unused.

This is a fine compromise.

The application was not designed to run on those CPUs anyway and need to be rewritten. The point is to allow such binaries to continue to run. Otherwise this is the equivalent of having broken compatibility.

> We cannot drop discard set bits because some concurrent algorithms need to
> know on which CPUs they can run.

Agreed. We can continue to return EINVAL in this case.

> I do not want to introduce a CPU number offset in user space (which could be
> used to run multiple small-affinity-set applications on all CPUs).  This is
> probably something for the kernel.  Container/namespace support for affinity
> masks may even make sense for other reasons.

I think with the above changes we could make this work.
Comment 9 Florian Weimer 2016-08-09 15:19:18 UTC
CPU ID translation should be performed by the kernel.  I filed:

  https://bugzilla.kernel.org/show_bug.cgi?id=151821