This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: What *is* the API for sched_getaffinity? Should sched_getaffinity always succeed when using cpu_set_t?
- From: KOSAKI Motohiro <kosaki dot motohiro at gmail dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>
- Cc: libc-alpha <libc-alpha at sourceware dot org>
- Date: Wed, 17 Jul 2013 18:56:09 -0400
- Subject: Re: What *is* the API for sched_getaffinity? Should sched_getaffinity always succeed when using cpu_set_t?
- References: <51E42BFE dot 7000301 at redhat dot com> <51E4A0BB dot 2070802 at gmail dot com> <51E4A123 dot 9070001 at gmail dot com> <51E6F3ED dot 8000502 at redhat dot com> <51E6F956 dot 5050902 at gmail dot com> <51E714DE dot 6060802 at redhat dot com>
> Thank you for the clarification.
>
> I think I see what you mean.
>
> There is no race.
>
> (1) No race in sched_getaffinity internally.
>
> Just to clarify:
>
> kernel/sched/core.c:
> ...
> 3714 SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
> 3715 unsigned long __user *, user_mask_ptr)
> ...
> 3720 if ((len * BITS_PER_BYTE) < nr_cpu_ids)
> 3721 return -EINVAL;
>
> This code causes EINVAL to be returned when the length of the userspace
> buffer is smaller than the size of the mask for ONLINE cpus.
No. nr_cpu_ids mean possible cpus. This code mean, if user buffer is less than
possible cpus, sched_getaffinity() always return EINVAL.
I agree this name is not fully clear. ;-)
> 3729 if (ret == 0) {
> 3730 size_t retlen = min_t(size_t, len, cpumask_size());
> 3731
> 3732 if (copy_to_user(user_mask_ptr, mask, retlen))
> 3733 ret = -EFAULT;
> 3734 else
> 3735 ret = retlen;
> 3736 }
>
> This code causes at most the mask of POSSIBLE cpus (cpumask_size())
> to be copied to the userspace buffer even if the userspace buffer is
> bigger. No error is returned if the userspace buffer is bigger than
> the POSSIBLE cpu mask size.
No. The implementation of cpumask_size() is here.
static inline size_t cpumask_size(void)
{
/* FIXME: Once all cpumask assignments are eliminated, this
* can be nr_cpumask_bits */
return BITS_TO_LONGS(NR_CPUS) * sizeof(long);
}
It points to compile time limitation (i.e. 4096).
min_t(size_t, len, cpumask_size()) prevents copy_to_user() read unrelated
kernel memory if len is bigger than internal cpu bitmask size.
btw, we don't call memset(0) for clearing rest buffer because backward
compatibility. gibc need to clear it if needed.
> Thus the worst case behaviour is as follows:
>
> sched_getaffinity (glibc)
> Allocate space for glibc buffer based on /sys values.
> Kernel hot plug adds a CPU and ONLINE cpus is incremented by 1 or more.
> Call kernel sched_getaffinity, result is EINVAL
> Kernel hot plug adds a CPU and ONLINE cpus is incremented by 1 or more.
> Call kernel sched_getaffinity, result is EINVAL
> ... Repeat until you reach POSSIBLE cpu limit.
> Call kernel sched_getaffinity, result is size of POSSIBLE cpu mask.
> Copy all or part of glibc buffer to user buffer.
>
> So there is no race, but possibly a long wait if the kernel
> is in the process of adding a lot of cpus quickly.
If separate buffer is acceptable, we can allocate possible cpu mask
width buffer. So,
I'm not sure why we need repeat.
> (2) No race between sched_getaffinity and sched_setaffinity.
>
> The call to sched_getaffinity always succeeds, and glibc knows the
> size of the ONLINE number of cpus (since kernel sched_getaffinity
> succeeds if you have at least enough space for ONLINE cpus).
again. it's not correct.
> You manipulate the mask and call sched_setaffinity which succeeds
> because either the length of the user buffer is less than the
> POSSIBLE cpu mask size and that is OK (the kernel copies what it
> can from userspace), or the buffer is larger than the POSSIBLE cpu
> mask size and the kernel ignores the higher bytes.
>
> e.g.
>
> kernel/sched/core.c:
> 3646 static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
> 3647 struct cpumask *new_mask)
> 3648 {
> 3649 if (len < cpumask_size())
> 3650 cpumask_clear(new_mask);
> 3651 else if (len > cpumask_size())
> 3652 len = cpumask_size();
> 3653
> 3654 return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;
> 3655 }
>
> Thus there is no race between the get/set interface and it always
> works without failure.
get_user_cpu_mask() is only user for sched_setaffinity().
sched_getaffinity() has
an another rule. At least, now.
>> Example:
>>
>> online cpus: 0-1023
>> possible cpus: 0-4095
>>
>> and, if the user allocate and uses 1024bit width bitmap, sched_getaffinity() always
>> fail even though he doesn't use hot plugging. So, this is not only race matter.
>
> I agree there is no race.
>
> However, I expected the current kernel code to work for your example
> since you changed sched_getaffinity in the kernel to work with ONLINE
> cpus?
No. see above.
>
> The current glibc sched_getaffinity will return EINVAL when we have:
>
> user buffer: 1024 bits.
> online cpus: 0-4095.
> possible cpus: 0-65535.
>
> This is because of:
>
> kernel/sched/core.c:
> 3720 if ((len * BITS_PER_BYTE) < nr_cpu_ids)
> 3721 return -EINVAL;
>
> In this case the *new* sched_getaffinity in glibc would loop or
> use _SC_NPROCESSORS_MAX_NP (new constant) to determine the possible
> number of cpus before making the syscall.
>> I hope to clarify the meanings of _SC_NPROCESSORS_CONF at first. I'm still
>> unclear we need to make new _SC_NPROCESSORS_MAX_NP.
>
> glibc uses /sys/devices/system/cpu or /proc/cpuinfo for
> computing the value of _SC_NPROCESSORS_CONF.
>
> We can't use _SC_NPROCESSORS_CONF here because this constant
> is only for configured CPUs which may be brought online by
> the OS.
>
> You might have an empty socket with no configured CPU and
> that would contribute towards possible cpus, but not configured
> or online cpus.
If 'configured' is such mean, you are correct. And then, we need to fix glibc
manual for clarify.
> It really seems to me that we need _SC_NPROCESSORS_MAX_NP to
> indicate the physical maximum number of CPUs the OS can connect
> given a particular topology, and it can return -1 to indicate
> it is completely a dynamic topology in which case all we can
> do is call sched_getaffinity with higher and higher buffers
> to find the current maximum (and we can't cache it).
Right.
> Does that clarify why I think we may need _SC_NPROCESSORS_MAX_NP?
Yes, completely. My question was, current glibc manual describe
sysconf (_SC_NPROCESSORS_CONF) mean "To determine the number of
processors available to the system one can run". And, if "can" mean
hot plugging, it should
be possible cpus. It was my point.