This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Remove CPU mask size detection from setaffinity


On 05/21/2015 07:43 AM, Florian Weimer wrote:
> On 05/21/2015 01:28 PM, Florian Weimer wrote:
>> > On 05/20/2015 06:24 PM, Florian Weimer wrote:
>>> >> The code looks quite broken to me and fails to achieve what it tries to
>>> >> do, as explained in the commit message.
>> > 
>> > Another advantage is that the magic number 1024 (128 bytes of CPU bits)
>> > is gone, so testing on smaller machines is possible.
>> > 
>> > Well, I did that (144 CPU cores, detected set size 192 bits), and it
>> > turns out that my pthread test case is completely broken. :-(
>> > 
>> > Here's a new one.  It tries to check that the active CPU set returned by
>> > the kernel has not been truncated, and that threads are not scheduled to
>> > CPUs outside the set.
> Now actually with the patch.
> 
> -- Florian Weimer / Red Hat Product Security
> 
> 
> 0001-Remove-CPU-set-size-checking-from-sched_setaffinity-.patch
> 
> 
> From ee05f3baec0f2d723c1410cd83845e82c7971264 Mon Sep 17 00:00:00 2001
> Message-Id: <ee05f3baec0f2d723c1410cd83845e82c7971264.1432208579.git.fweimer@redhat.com>
> From: Florian Weimer <fweimer@redhat.com>
> Date: Thu, 21 May 2015 12:42:52 +0100
> Subject: [PATCH] Remove CPU set size checking from sched_setaffinity,
>  pthread_setaffinity_np
> To: libc-alpha@sourceware.org
> 
> With current kernel versions, the check does not reliably detect that
> unavailable CPUs are requested, for these reasons:
> 
> (1) The kernel will silently ignore non-allowed CPUs.
> 
> (2) Similarly, CPU bits which lack an online CPU (possible CPUs)
>     are ignored.
> 
> (3) The existing probing code assumes that the CPU mask size is a
>     power of two and at least 1024.  Neither it has to be a power
>     of two, nor is the minimum possible value 1024, so the value
>     determined is often too large, resulting in incorrect false
>     negatives.
>
> The kernel will still return EINVAL if no CPU in the requested set
> remains which can run the current thread after the affinity change.
> 
> Applications which care about the exact affinity mask will have to query
> it using sched_getaffinity after setting it.

Please bear with me as we work out the exact details and semantics
of the changes.

Firstly, I'd like to see a clear definition of the problem we are
trying to solve before we remove the existing code. Not to say that
the existing code is great, but it keeps the status quo and expectations
from user programs are preserved. Describing the problem also prevents
others from asking "Oh, what about the >1024 cpu problem?" and "Oh,
what about the proper use of online, possible, present CPUs and how
to express them via sysconf?" etc.

Is this the problem?

"The glibc sched_setaffinity routine incorrectly attempts to compute the
 kernel cpumask_t, and therefore does more harm than good imposing a 
 incorrectly computed cpumask_t size as a limit." With "harm" exaplained
 in detail.

The glibc API description for sched_setaffinity says EINVAL
will be returned when the affinity bit mask contains no processes
that are currently physically on the system, or are permitted to
run. It does not fail if the bit mask is larger than the kernel
cpumask_t and that extra space contains zeroes. In fact the glibc
manual states that this information can even be used in the *future*
by the scheduler, one presumes, when more cpus come online.

The present code in sysdeps/unix/sysv/linux/sched_setaffinity.c
uses sched_getaffinity to compute what it believes is the maximum
size of the kernel cpu_mask_t. The loop there will succeed at the
first call since most systems default to NR_CPUS[1] to 1023.
If NR_CPUS were very high, the limit the kernel uses is nr_cpu_ids,
which is still the correct limit and reflects maximum possible cpus
as detected by hardware (as opposed to the compile time max).
The kernel code and glibc code as far as I can tell will not correctly
compute nr_cpu_ids nor NR_CPUS, but will compute the largest power of
two size, starting with 128 bytes, that the kernel will accept as a
cpumask_t. This value need not be nr_cpu_ids, and if nr_cpu_ids is
larger than 1024, then it certainly won't be since the kernel is happy
to copy back a partial set of bits e.g. len < nr_cpu_ids. This leads
to a bug in glibc when trying to set the 1025th's cpu affinity, since
the function would reject this as a bit outside of the kernel cpu mask.


If the kernel ignores non-allowed cpus, that would seem like a bug
in the interface. Either the call succeeds according to your request
or it doesn't.

If the kernel ignores bits for offline cpus, that is also a bug,
it should record them and when the cpus come online and available
they should be used.

However, I have not worked on these interfaces on the kernel side,
and I would appreciate Motohiro-san's experience in this matter.

Thus I only see kernel bugs in not notifying, with an error return
value, that a forbidden cpu was specified in the mask and ignored,
or ignoring a set bit for an offline cpu.

The glibc implementation of sched_setaffinity wrongly computes
nr_cpu_ids once, caches it (since it can't change), and then uses 
hat to implement a sensible API where we reject cpu bits set for cpus
that can't exist given the booted hardware. The API should also
additionally, never fail.

The bugs I see in glibc are:
- Fix nr_cpu_id computation from glibc side by simply reading
  the sysfs value of possible, or iterating sched_getaffinity
  until it returns the same value twice. This would fix the case
  on systems with lots of cpus.
- Add a new sysconf parameter for _SC_NPROCESSORS_MAX to indicate
  possible, that way a user can read this first and use that value
  to do dynamic cpu set allocation.

Not to mention refactoring all of this code to sysconf, and having
sched_[sg]affinity use the refactored functions.

In summary:

Don't remove the existing code, but fix it.


Cheers,
Carlos.

[1] /sys/devices/system/cpu/kernel_max or NR_CPUS or CONFIG_NR_CPUS
    and is a static compile time constant maximum number of cpus
    that can ever be detected and made online by this kernel. It
    is not the possible number of CPUs, that value is <= than
    NR_CPUS and limited by the topology of the system. For example
    possible cpus might be a maximum of 254 cpus on an 8-bit APIC
    system (two IDs reserved, one for broadcast and one for the io-apic).
    In general nr_cpu_ids <= NR_CPUS, and usually set to the possible
    hardware number of cpus since that is lower than NR_CPUS and save
    time iterating loops.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]