This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: What *is* the API for sched_getaffinity? Should sched_getaffinity always succeed when using cpu_set_t?


(7/23/13 9:33 AM), Carlos O'Donell wrote:
On 07/22/2013 11:52 PM, KOSAKI Motohiro wrote:
* Used to create a minimally sized structure to track
   per-logical-CPU data.
   - As it is implemented _SC_NPROCESSORS_CONF is a minimal
     value. Fixing it to match your expected semantics e.g
     making it the number of possible CPUs, is going
     to make this value potentially much larger.

Hmm.

This doesn't cross my mind. I think this case should use (c) because otherwise
your application may crash when cpu hotplug occur.

It may crash only if you don't keep the data consitent e.g. using a cpu
count that doesn't match the allocated structure size.

Right.

However, I think I'm starting to agree with Roland, read my thoughts
below.

Practically, we have no way to know cpu hot adding _synchronously_ and
there is several
race window if you are polling to change ._SC_NPROCESSORS_CONF value.

With any dynamic system either notification-based or polling you are going
to have some period of time where the userspace data structure is going
to be out of sync with the real hardware state and it takes time for the
state to become consistent. Userspace code would have to be written to
adjust for this (ignoring POSIX requirement that these constants never
change).

However, if userspace code has to adjust for changing _SC_NPROCESSORS_CONF,
assuming it isn't equal to possible CPUs, then we have exactly the same
problem we had with sched_getaffinity returning EINVAL. Userspace code must
be written in a loop to look at _SC_NPROCESSORS_CONF and increase storage
size if the number of configured CPUs grows larger.

So while we saved the user writing looping code for sched_getaffinity,
we still have the same looping code one level higher.

I think that perhaps Roland is right here and that _SC_NPROCESSORS_CONF
should just be changed to match "possible" cpus for linux.

Actually, in kernel possible cpus structure was mainly created to
effectively represent
per-cpu data.

OK, so the reality is that possible shouldn't be orders of magnitude
larger than configured or online cpus?

It depend on firmware. sorry, we can't promise anything. We have two
worst case. 1) the firm ware is buggy and report plenty CPUs wrongly.
I have no idea about that. But note, I have no seen such mistake in
LKML. 2) Customer buys hotplug aware high-eng machine but he only buy
a few cpus. In this case, firmware report "hey, I have a lot of enhancement
chance!". This is also unavoidable. But, as far as I observed, such high-end
machine is very costly, and then customers don't take such wrong choice.

From point of applications view, we have a way to mitigate the above problems.
To use mmap() instead of malloc()+memset() help to mitigate waste memory because
Linux allocate actual memory at first touch. Then if some per-cpu area never
accessed, memory wouldn't allocated at all. Yes, this is not perfect because
page fault is page size operation. We can waste page size memory. However,
at least, Linux kernel does similar technique and it is considered acceptable
memory usage in wide range customers.

The rest problem is mlockall(). If you use mlockall() all allocation data will
be allocated at mmap() time instead of first touch. Then, waste memory will be
increased. It is userland specific and kernel people have no idea. But I guess
such applications like memory waste rather than unpredictable latency delay.

This is my concern. Does this cross your mind?







Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]