This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 1/2] Single thread optimization for malloc atomics
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: munroesj at us dot ibm dot com
- Cc: Rich Felker <dalias at libc dot org>, Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, "GNU C. Library" <libc-alpha at sourceware dot org>
- Date: Thu, 1 May 2014 00:04:54 +0200
- Subject: Re: [PATCH 1/2] Single thread optimization for malloc atomics
- Authentication-results: sourceware.org; auth=none
- References: <53610133 dot 3070908 at linux dot vnet dot ibm dot com> <20140430141845 dot GA6882 at domone dot podge> <20140430160618 dot GJ26358 at brightrain dot aerifal dot cx> <1398888765 dot 16559 dot 15 dot camel at spokane1 dot rchland dot ibm dot com>
On Wed, Apr 30, 2014 at 03:12:45PM -0500, Steven Munroe wrote:
> On Wed, 2014-04-30 at 12:06 -0400, Rich Felker wrote:
> > On Wed, Apr 30, 2014 at 04:18:45PM +0200, OndÅej BÃlka wrote:
> > > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote:
> > > > This patch adds a single-thread optimization for malloc atomic usage to
> > > > first check if process is single-thread (ST) and if so use normal
> > > > load/store instead of atomic instructions.
> > > >
> > > How fast is tls on power? When we add a per-thread cache as I suggested
> > > then it would have most of time same performance as singlethread, with
> > > overhead one tls variable access per malloc call.
> >
> > Extremely fast: the TLS address is simply kept in a general-purpose
> > register.
> >
> Depends on the TLS access model.
>
> General Dynamic TLS Model requires a dynamic up-call to _tld_get_addr().
> So slow.
>
> If you can get to the Local Exec or Initial Exec form (where the dvt
> slot or TLS offset can be known at static link time) it can be a simple
> inline computation.
>
> As we are talking about a dynamic library (libc.so) here, you have to
> set this up carefully.
On malloc we already use initial exec, see tsd_getspecific macro in
malloc/arena.c.
By the way general tls slowness is caused mostly by ineffective
implementation. If you do not mind adding a pointer variable p in ie form
to each binary then you could emulate tls by referencing p and two array
lookups (except for first access which triggers branch that calls
something.)