This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Mathieu Desnoyers <mathieu dot desnoyers at efficios dot com>
Cc: Linus Torvalds <torvalds at linux-foundation dot org>, Andy Lutomirski <luto at amacapital dot net>, Ben Maurer <bmaurer at fb dot com>, Ingo Molnar <mingo at redhat dot com>, libc-alpha <libc-alpha at sourceware dot org>, Andrew Morton <akpm at linux-foundation dot org>, linux-api <linux-api at vger dot kernel dot org>, rostedt <rostedt at goodmis dot org>, "Paul E. McKenney" <paulmck at linux dot vnet dot ibm dot com>, Florian Weimer <fweimer at redhat dot com>, Josh Triplett <josh at joshtriplett dot org>, Lai Jiangshan <laijs at cn dot fujitsu dot com>, Paul Turner <pjt at google dot com>, Andrew Hunter <ahh at google dot com>, Peter Zijlstra <peterz at infradead dot org>
Date: Tue, 21 Jul 2015 20:00:51 +0200
Subject: Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
Authentication-results: sourceware.org; auth=none
References: <1436724386-30909-1-git-send-email-mathieu dot desnoyers at efficios dot com> <CA+55aFzMJkzydXb7uVv1iSUnp=539d43ghQaonGdzMoF7QLZBA at mail dot gmail dot com> <CALCETrUZ8vB30rdmeoV4JKPUsRnVPvoxXRJ47CEFud2aSF2=Ew at mail dot gmail dot com> <CA+55aFwLZLeeN7UN82dyt=emQcNBc8qZPJAw5iqtAbBwFA7FPQ at mail dot gmail dot com> <2010227315 dot 699 dot 1437438300542 dot JavaMail dot zimbra at efficios dot com> <20150721073053 dot GA14716 at domone> <894137397 dot 137 dot 1437483493715 dot JavaMail dot zimbra at efficios dot com> <20150721151613 dot GA12856 at domone> <1350114812 dot 1035 dot 1437500726799 dot JavaMail dot zimbra at efficios dot com>

On Tue, Jul 21, 2015 at 05:45:26PM +0000, Mathieu Desnoyers wrote:
> ----- On Jul 21, 2015, at 11:16 AM, OndÅej BÃlka neleai@seznam.cz wrote:
> 
> > On Tue, Jul 21, 2015 at 12:58:13PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jul 21, 2015, at 3:30 AM, OndÅej BÃlka neleai@seznam.cz wrote:
> >> 
> >> > On Tue, Jul 21, 2015 at 12:25:00AM +0000, Mathieu Desnoyers wrote:
> >> >> >> Does it solve the Wine problem?  If Wine uses gs for something and
> >> >> >> calls a function that does this, Wine still goes boom, right?
> >> >> > 
> >> >> > So the advantage of just making a global segment descriptor available
> >> >> > is that it's not *that* expensive to just save/restore segments. So
> >> >> > either wine could do it, or any library users would do it.
> >> >> > 
> >> >> > But anyway, I'm not sure this is a good idea. The advantage of it is
> >> >> > that the kernel support really is _very_ minimal.
> >> >> 
> >> >> Considering that we'd at least also want this feature on ARM and
> >> >> PowerPC 32/64, and that the gs segment selector approach clashes with
> >> >> existing apps (wine), I'm not sure that implementing a gs segment
> >> >> selector based approach to cpu number caching would lead to an overall
> >> >> decrease in complexity if it leads to performance similar to those of
> >> >> portable approaches.
> >> >> 
> >> >> I'm perfectly fine with architecture-specific tweaks that lead to
> >> >> fast-path speedups, but if we have to bite the bullet and implement
> >> >> an approach based on TLS and registering a memory area at thread start
> >> >> through a system call on other architectures anyway, it might end up
> >> >> being less complex to add a new system call on x86 too, especially if
> >> >> fast path overhead is similar.
> >> >> 
> >> >> But I'm inclined to think that some aspect of the question eludes me,
> >> >> especially given the amount of interest generated by the gs-segment
> >> >> selector approach. What am I missing ?
> >> >> 
> >> > As I wrote before you don't have to bite bullet as I said before. It
> >> > suffices to create 128k element array with cpu for each tid, make that
> >> > mmapable file and userspace could get cpu with nearly same performance
> >> > without hacks.
> >> 
> >> I don't see how this would be acceptable on memory-constrained embedded
> >> systems. They have multiple cores, and performance requirements, so
> >> having a fast getcpu would be useful there (e.g. telecom industry),
> >> but they clearly cannot afford a 512kB table per process just for that.
> >> 
> > Which just means that you need more complicated api and implementation
> > for that but idea stays same. You would need syscalls
> > register/deregister_cpuid_idx that would give you index used instead
> > tid. A kernel would need to handle that many ids could be registered for
> > each thread and resize mmaped file in syscalls.
> 
> I feel we're talking past each other here. What I propose is to implement
> a system call that registers a TLS area. It can be invoked at thread start.
> The kernel can then keep the current CPU number within that registered
> area up-to-date. This system call does not care how the TLS is implemented
> underneath.
> 
> My understanding is that you are suggesting a way to speed up TLS accesses
> by creating a table indexed by TID. Although it might lead to interesting
> speed ups useful when reading the TLS, I don't see how you proposal is
> useful in addressing the problem of caching the current CPU number (other
> than possibly speeding up TLS accesses).
> 
> Or am I missing something fundamental to your proposal ?
>
No, I still talk about getting cpu number. My first proposal is that
kernel allocates table of current cpu numbers accessed by tid. That
could process mmap and get cpu with cpu_tid_table[tid]. As you said that
size is problem I replied that you need to be more careful. Instead tid
you will use different id that you get with say register_cpucache, store
in tls variable and get cpu with cpu_cid_table[cid]. That decreases
space used to only threads that use this.

A tls speedup was side remark when you would implement per-cpu page then
you could speedup tls. As tls access speed and getting tid these are
equivalent as you could easily implement one with other.

Follow-Ups:
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Mathieu Desnoyers

References:
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Andy Lutomirski
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Linus Torvalds
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Mathieu Desnoyers
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: OndÅej BÃlka
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Mathieu Desnoyers
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: OndÅej BÃlka
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Mathieu Desnoyers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]