This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)

From: Mathieu Desnoyers <mathieu dot desnoyers at efficios dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: Linus Torvalds <torvalds at linux-foundation dot org>, Andy Lutomirski <luto at amacapital dot net>, Ben Maurer <bmaurer at fb dot com>, Ingo Molnar <mingo at redhat dot com>, libc-alpha <libc-alpha at sourceware dot org>, Andrew Morton <akpm at linux-foundation dot org>, linux-api <linux-api at vger dot kernel dot org>, rostedt <rostedt at goodmis dot org>, "Paul E. McKenney" <paulmck at linux dot vnet dot ibm dot com>, Florian Weimer <fweimer at redhat dot com>, Josh Triplett <josh at joshtriplett dot org>, Lai Jiangshan <laijs at cn dot fujitsu dot com>, Paul Turner <pjt at google dot com>, Andrew Hunter <ahh at google dot com>, Peter Zijlstra <peterz at infradead dot org>
Date: Tue, 21 Jul 2015 18:18:03 +0000 (UTC)
Subject: Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
Authentication-results: sourceware.org; auth=none
References: <1436724386-30909-1-git-send-email-mathieu dot desnoyers at efficios dot com> <CA+55aFwLZLeeN7UN82dyt=emQcNBc8qZPJAw5iqtAbBwFA7FPQ at mail dot gmail dot com> <2010227315 dot 699 dot 1437438300542 dot JavaMail dot zimbra at efficios dot com> <20150721073053 dot GA14716 at domone> <894137397 dot 137 dot 1437483493715 dot JavaMail dot zimbra at efficios dot com> <20150721151613 dot GA12856 at domone> <1350114812 dot 1035 dot 1437500726799 dot JavaMail dot zimbra at efficios dot com> <20150721180051 dot GA24053 at domone>

----- On Jul 21, 2015, at 2:00 PM, OndÅej BÃlka neleai@seznam.cz wrote:

> On Tue, Jul 21, 2015 at 05:45:26PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jul 21, 2015, at 11:16 AM, OndÅej BÃlka neleai@seznam.cz wrote:
>> 
>> > On Tue, Jul 21, 2015 at 12:58:13PM +0000, Mathieu Desnoyers wrote:
>> >> ----- On Jul 21, 2015, at 3:30 AM, OndÅej BÃlka neleai@seznam.cz wrote:
>> >> 
>> >> > On Tue, Jul 21, 2015 at 12:25:00AM +0000, Mathieu Desnoyers wrote:
>> >> >> >> Does it solve the Wine problem?  If Wine uses gs for something and
>> >> >> >> calls a function that does this, Wine still goes boom, right?
>> >> >> > 
>> >> >> > So the advantage of just making a global segment descriptor available
>> >> >> > is that it's not *that* expensive to just save/restore segments. So
>> >> >> > either wine could do it, or any library users would do it.
>> >> >> > 
>> >> >> > But anyway, I'm not sure this is a good idea. The advantage of it is
>> >> >> > that the kernel support really is _very_ minimal.
>> >> >> 
>> >> >> Considering that we'd at least also want this feature on ARM and
>> >> >> PowerPC 32/64, and that the gs segment selector approach clashes with
>> >> >> existing apps (wine), I'm not sure that implementing a gs segment
>> >> >> selector based approach to cpu number caching would lead to an overall
>> >> >> decrease in complexity if it leads to performance similar to those of
>> >> >> portable approaches.
>> >> >> 
>> >> >> I'm perfectly fine with architecture-specific tweaks that lead to
>> >> >> fast-path speedups, but if we have to bite the bullet and implement
>> >> >> an approach based on TLS and registering a memory area at thread start
>> >> >> through a system call on other architectures anyway, it might end up
>> >> >> being less complex to add a new system call on x86 too, especially if
>> >> >> fast path overhead is similar.
>> >> >> 
>> >> >> But I'm inclined to think that some aspect of the question eludes me,
>> >> >> especially given the amount of interest generated by the gs-segment
>> >> >> selector approach. What am I missing ?
>> >> >> 
>> >> > As I wrote before you don't have to bite bullet as I said before. It
>> >> > suffices to create 128k element array with cpu for each tid, make that
>> >> > mmapable file and userspace could get cpu with nearly same performance
>> >> > without hacks.
>> >> 
>> >> I don't see how this would be acceptable on memory-constrained embedded
>> >> systems. They have multiple cores, and performance requirements, so
>> >> having a fast getcpu would be useful there (e.g. telecom industry),
>> >> but they clearly cannot afford a 512kB table per process just for that.
>> >> 
>> > Which just means that you need more complicated api and implementation
>> > for that but idea stays same. You would need syscalls
>> > register/deregister_cpuid_idx that would give you index used instead
>> > tid. A kernel would need to handle that many ids could be registered for
>> > each thread and resize mmaped file in syscalls.
>> 
>> I feel we're talking past each other here. What I propose is to implement
>> a system call that registers a TLS area. It can be invoked at thread start.
>> The kernel can then keep the current CPU number within that registered
>> area up-to-date. This system call does not care how the TLS is implemented
>> underneath.
>> 
>> My understanding is that you are suggesting a way to speed up TLS accesses
>> by creating a table indexed by TID. Although it might lead to interesting
>> speed ups useful when reading the TLS, I don't see how you proposal is
>> useful in addressing the problem of caching the current CPU number (other
>> than possibly speeding up TLS accesses).
>> 
>> Or am I missing something fundamental to your proposal ?
>>
> No, I still talk about getting cpu number. My first proposal is that
> kernel allocates table of current cpu numbers accessed by tid. That
> could process mmap and get cpu with cpu_tid_table[tid]. As you said that
> size is problem I replied that you need to be more careful. Instead tid
> you will use different id that you get with say register_cpucache, store
> in tls variable and get cpu with cpu_cid_table[cid]. That decreases
> space used to only threads that use this.
> 
> A tls speedup was side remark when you would implement per-cpu page then
> you could speedup tls. As tls access speed and getting tid these are
> equivalent as you could easily implement one with other.

Thanks for the clarification. There is then a fundamental question
I need to ask: what is the upside of going for a dedicated array of
current cpu number values rather than using a TLS variable ?
The main downside I see with the array of cpu number is false sharing
caused by having many current cpu number variables sitting on the same
cache line. It seems like an overall performance loss there.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Follow-Ups:
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: OndÅej BÃlka

References:
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Linus Torvalds
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Mathieu Desnoyers
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: OndÅej BÃlka
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Mathieu Desnoyers
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: OndÅej BÃlka
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: Mathieu Desnoyers
- Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]