This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
- From: Mathieu Desnoyers <mathieu dot desnoyers at efficios dot com>
- To: Linus Torvalds <torvalds at linux-foundation dot org>
- Cc: Andy Lutomirski <luto at amacapital dot net>, Ben Maurer <bmaurer at fb dot com>, Ingo Molnar <mingo at redhat dot com>, libc-alpha <libc-alpha at sourceware dot org>, Andrew Morton <akpm at linux-foundation dot org>, linux-api <linux-api at vger dot kernel dot org>, OndÅej BÃlka <neleai at seznam dot cz>, rostedt <rostedt at goodmis dot org>, "Paul E. McKenney" <paulmck at linux dot vnet dot ibm dot com>, Florian Weimer <fweimer at redhat dot com>, Josh Triplett <josh at joshtriplett dot org>, Lai Jiangshan <laijs at cn dot fujitsu dot com>, Paul Turner <pjt at google dot com>, Andrew Hunter <ahh at google dot com>, Peter Zijlstra <peterz at infradead dot org>
- Date: Tue, 21 Jul 2015 00:25:00 +0000 (UTC)
- Subject: Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
- Authentication-results: sourceware.org; auth=none
- References: <1436724386-30909-1-git-send-email-mathieu dot desnoyers at efficios dot com> <55ACB2DC dot 5010503 at redhat dot com> <CALCETrV9Vp5UUOb3e_R5tphyE-urBgTwQR2pFWUOOFnHqWXHKQ at mail dot gmail dot com> <55AD14A4 dot 6030101 at redhat dot com> <CALCETrUx6wFxmz+9TyW5bNgaMN0q180G8y9YOyq_D41sdhFaRQ at mail dot gmail dot com> <CA+55aFzMJkzydXb7uVv1iSUnp=539d43ghQaonGdzMoF7QLZBA at mail dot gmail dot com> <CALCETrUZ8vB30rdmeoV4JKPUsRnVPvoxXRJ47CEFud2aSF2=Ew at mail dot gmail dot com> <CA+55aFwLZLeeN7UN82dyt=emQcNBc8qZPJAw5iqtAbBwFA7FPQ at mail dot gmail dot com>
----- On Jul 20, 2015, at 6:39 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
> On Mon, Jul 20, 2015 at 2:09 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Annoying problem one: the segment base field is only 32 bits in the GDT.
>
> Ok. So if we go this way, we'd make the rule be something like "the
> segment base is the CPU number shifted up by the page size", and then
> you'd have to add some magic offset that we'd declare as the "per-cpu
> page offset".
>
>>> - user space can just load the segment selector in %gs
>>
>> IIRC this is very expensive -- 40 cycles or so. At this point
>> userspace might as well just use a real lock cmpxchg.
>
> So cmpxchg may be as many cycles, but
>
> (a) you can choose to load the segment just once, and do several
> operations with it
>
> (b) often - but admittedly not always - the real cost of a
> non-cpu-local local and cmpxchg tends to be the cacheline ping-pong,
> not the CPU cycles.
>
> so I agree, loading a segment isn't free. But it's not *that*
> expensive, and you could always decide to keep the segment loaded and
> just do
>
> - read segment selector
> - if NUL segment, reload it.
>
> although that only works if you own the segment entirely and can keep
> it as the percpu segment (ie obviously not the Wine case, for
> example).
>
>> Does it solve the Wine problem? If Wine uses gs for something and
>> calls a function that does this, Wine still goes boom, right?
>
> So the advantage of just making a global segment descriptor available
> is that it's not *that* expensive to just save/restore segments. So
> either wine could do it, or any library users would do it.
>
> But anyway, I'm not sure this is a good idea. The advantage of it is
> that the kernel support really is _very_ minimal.
Considering that we'd at least also want this feature on ARM and
PowerPC 32/64, and that the gs segment selector approach clashes with
existing apps (wine), I'm not sure that implementing a gs segment
selector based approach to cpu number caching would lead to an overall
decrease in complexity if it leads to performance similar to those of
portable approaches.
I'm perfectly fine with architecture-specific tweaks that lead to
fast-path speedups, but if we have to bite the bullet and implement
an approach based on TLS and registering a memory area at thread start
through a system call on other architectures anyway, it might end up
being less complex to add a new system call on x86 too, especially if
fast path overhead is similar.
But I'm inclined to think that some aspect of the question eludes me,
especially given the amount of interest generated by the gs-segment
selector approach. What am I missing ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com