This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] enable fdpic targets/emulations for sh--linux*

From: Oleg Endo <oleg dot endo at t-online dot de>
To: Rich Felker <dalias at libc dot org>
Cc: binutils at sourceware dot org
Date: Sat, 03 Oct 2015 01:56:02 +0900
Subject: Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
Authentication-results: sourceware.org; auth=none
References: <20150929235801 dot GA8408 at brightrain dot aerifal dot cx> <1443612038 dot 2509 dot 140 dot camel at t-online dot de> <20150930142533 dot GC8645 at brightrain dot aerifal dot cx> <20150930143555 dot GD8645 at brightrain dot aerifal dot cx> <1443627005 dot 2509 dot 189 dot camel at t-online dot de> <20150930183810 dot GE8645 at brightrain dot aerifal dot cx> <1443715139 dot 2031 dot 134 dot camel at t-online dot de> <20151001164630 dot GI8645 at brightrain dot aerifal dot cx>

On Thu, 2015-10-01 at 12:46 -0400, Rich Felker wrote:

> > Sorry, I don't understand the relationship of function call overhead and
> > gUSA vs. memory synchronizing instructions ... The "problem" with real
> > function calls (vs specialized libcalls) is that more registers need to
> > be saved/restored than the function actually clobbers.
> 
> Yes, I was just abstracting the time cost of those clobbers. Of course
> there is some code-size issue too that I failed to address.

One of the "properties" (at least according to my understanding) of
atomics is that on average contention probability is low and thus the
code will be a straight read-modify-write sequence.  For a gUSA sequence
like
	mova	1f,r0
	mov	r15,r1
	.align 2
	mov	#(0f-1f),r15
0:	mov.l	@r4,r2
	add	#1,r2
	mov.l	r2,@r4
1:	mov	r1,r15

We get around 5 cycles on SH4 (the SH4A LLCS version is 4 cycles).  So
it's not much slower than a non-atomic read-modify-write sequence.

If you pipe it through function calls:
	mov.l	.L3,r0
	mov	#5,r6
	jmp	@r0
	mov	#1,r5
.L4:
	.align 2
.L3:
	.long	___atomic_fetch_add_4

That's about 4 cycles just for the function call alone.  Without any reg
saves/restores.  Plus increased probability of a icache miss yada yada.
At best, this is twice as slow as inlined.

> I think even something like a loop of (pthread) lock/x++/unlock with
> no contention would be an interesting case to check to see if it's
> even plausible that this matters to real-world performance for
> high-level primitives.

Atomics are used also as a stand-alone low-level primitives for building
various things like std::shared_ptr.  For that the impact is bigger than
for high-level primitives.

> I'm still really skeptical that all of this multiversioning stuff
> (e.g. building multiple versions of individual functions rather than
> just calling atomic functions or 'function fragments' with multiple
> backends) is premature optimization and a lot of
> complexity/infrastructure for little or no goain.

The runtime performance issues aside, I was trying to come up with
something that works for more than just the atomic problem at hand.
Yes, the it's more complex.  Yes, there might be no immediate gain.  It
might pay off later, though.

>  (And if the concern
> is size, the runtime code to switching out code with [dynamic-]linker
> like mechanisms is probably much larger than any savings you could
> get.) 

Could be, could be not.  I guess we won't know until it's there (at
least on paper).

> It also seems to require serious forethought about possible
> targets the binary could be moved to rather than just working by
> default.

Reasonable default settings can be configured-in into the the toolchain
that you give your users.  Then you can configure it to output code for
some lowest common denominator target by default.

> The ideal solution in my mind is that applications that really care
> (or think they should care) about size or performance benefits of
> inlining model-specific atomics can use the right -matomic-model
> option and get marked as only working on certain cpu models, and
> otherwise (the majority of) apps where it's not going to matter just
> work by default on anything.

Sometimes these kind of settings also affect libraries (like libstdc++)
which users usually don't build themselves and simply rely on reasonably
optimized pre-built versions.  For a library such as libstdc++ it's
fairly difficult to predict how it will be used.

> Yes, I just raised it again because it's been a dead-end where it was
> discussed before, and I think it's the elephant in the room when
> you're considering SH ABI issues. Do you have a proposed place we
> should discuss this with the intent of actually reaching consensus on
> a resolution with anyone who may be affected by it?

Maybe this place here is not that bad actually.  It's archived and easy
to find on the net.  Moreover, I believe it all starts with the
features/options provided by the ELF, GAS and LD.  Library/kernel people
tend just to "work with what's there" ... 

> > And the other option is not to use imask but TLS atomics ...
> 
> You mean soft-tcb? Those aren't even implemented in the kernel. 

I've added them mainly with non-linux in mind.

> And the requirement for the range of the offset conflicts with the TLS
> model where those offsets belong to the application's initial-exec
> TLS. This could be avoided by putting an object in crt1.o that would
> 'automatically' get assigned the lowest offset to reserve it, but that
> sounds like an ugly hack and would add TLS sections/program headers to
> all programs.

The offset restriction can be lifted by using other GBR insns.  If
there's interest, this can be done.

> Anyway this response has already gotten really long so I'm dropping
> off here for a bit (sorry I didn't get to the end). I think it would
> be helpful to try to focus further discussion and possibly split it up
> into new topic-specific threads on the appropriate lists. Feel free to
> Cc me right away on any new SH threads you start.

In my opinion, the SH ABI compatibility problem should be addressed at a
global scope, not focusing/limiting a solution to a particular ABI
subset (like atomics) but it'll probably be a lot of work.

Currently we have multilibs which actually work OK.  The compiler should
just list more permutations for targeting different kinds of systems.
This can be used to support your current atomics implementation in musl
by providing a multilib for "atomic function calls" (BTW the file
libgcc/config/sh/linux-atomic.c in GCC might be of interest, too)

One thing that is missing is fine-grained encoding (and checking) of the
current ISA/ABI variations and features that are used in the ELF.
Initially these flags/attributes can be some hardcoded sets selected
with --isa options (or compiler's -m options).  Then GAS directives can
be added to get more fine grained control via target function
attributes.  Later the compiler and/or GAS/LD can emit them
automatically (based on the code they produces).

The other thing is that the current SH ABIs are incompatible with each
other and there are actually a couple of deficits.  Some of them are
mentioned here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56592

The -fpu vs -nofpu problem can be solved as it's done on ARM with a
soft-float calling convention (passing all args in GP regs).  Maybe it'd
make sense to define a new ABI for compatibility.

Cheers,
Oleg

Follow-Ups:
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Rich Felker

References:
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Oleg Endo
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Rich Felker

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*

Re: [PATCH] enable fdpic targets/emulations for sh--linux*