This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] enable fdpic targets/emulations for sh--linux*

From: Rich Felker <dalias at libc dot org>
To: Oleg Endo <oleg dot endo at t-online dot de>
Cc: binutils at sourceware dot org
Date: Fri, 2 Oct 2015 13:52:23 -0400
Subject: Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
Authentication-results: sourceware.org; auth=none
References: <20150929235801 dot GA8408 at brightrain dot aerifal dot cx> <1443612038 dot 2509 dot 140 dot camel at t-online dot de> <20150930142533 dot GC8645 at brightrain dot aerifal dot cx> <20150930143555 dot GD8645 at brightrain dot aerifal dot cx> <1443627005 dot 2509 dot 189 dot camel at t-online dot de> <20150930183810 dot GE8645 at brightrain dot aerifal dot cx> <1443715139 dot 2031 dot 134 dot camel at t-online dot de> <20151001164630 dot GI8645 at brightrain dot aerifal dot cx> <1443804962 dot 2031 dot 290 dot camel at t-online dot de>

On Sat, Oct 03, 2015 at 01:56:02AM +0900, Oleg Endo wrote:
> On Thu, 2015-10-01 at 12:46 -0400, Rich Felker wrote:
> 
> > > Sorry, I don't understand the relationship of function call overhead and
> > > gUSA vs. memory synchronizing instructions ... The "problem" with real
> > > function calls (vs specialized libcalls) is that more registers need to
> > > be saved/restored than the function actually clobbers.
> > 
> > Yes, I was just abstracting the time cost of those clobbers. Of course
> > there is some code-size issue too that I failed to address.
> 
> One of the "properties" (at least according to my understanding) of
> atomics is that on average contention probability is low and thus the
> code will be a straight read-modify-write sequence.  For a gUSA sequence
> like
> 	mova	1f,r0
> 	mov	r15,r1
> 	.align 2
> 	mov	#(0f-1f),r15
> 0:	mov.l	@r4,r2
> 	add	#1,r2
> 	mov.l	r2,@r4
> 1:	mov	r1,r15
> 
> We get around 5 cycles on SH4 (the SH4A LLCS version is 4 cycles).  So
> it's not much slower than a non-atomic read-modify-write sequence.
> 
> If you pipe it through function calls:
> 	mov.l	.L3,r0
> 	mov	#5,r6
> 	jmp	@r0
> 	mov	#1,r5
> ..L4:
> 	.align 2
> ..L3:
> 	.long	___atomic_fetch_add_4
> 
> That's about 4 cycles just for the function call alone.  Without any reg
> saves/restores.  Plus increased probability of a icache miss yada yada.
> At best, this is twice as slow as inlined.

That's not PIC-compatible, and it also requires additional branch
logic in the called function. So I think it's a lot worse in practice
if you do it that way. I would aim to expose a function pointer for
the runtime-selected version and inline loading that function pointer.
I would also aim to make the calling convention avoid needing a GOT
pointer in the callee and avoid clobbering pr; this can be done by
using a sequence like:

	mova .L1,r0
	mov.l @r0+,rn
	...
	braf r0

.L1:	.long whatever
	...
	[next code]

and the callee returns by jumping to the value it received as r0 or
similar.

Unfortunately it seems hard to make this self-initializing without
access to global data. For libc that's not a problem because I can
ensure that libc init happens before any code that could call the
atomics, but for libatomic-type stuff I think it's difficult.

IFUNC is rather a mess and non-solution (this has been discussed a lot
in the musl community) and it's not clear how to make it work with
static linking at all.

> > I think even something like a loop of (pthread) lock/x++/unlock with
> > no contention would be an interesting case to check to see if it's
> > even plausible that this matters to real-world performance for
> > high-level primitives.
> 
> Atomics are used also as a stand-alone low-level primitives for building
> various things like std::shared_ptr.  For that the impact is bigger than
> for high-level primitives.

Agreed.

> > Yes, I just raised it again because it's been a dead-end where it was
> > discussed before, and I think it's the elephant in the room when
> > you're considering SH ABI issues. Do you have a proposed place we
> > should discuss this with the intent of actually reaching consensus on
> > a resolution with anyone who may be affected by it?
> 
> Maybe this place here is not that bad actually.  It's archived and easy
> to find on the net.  Moreover, I believe it all starts with the
> features/options provided by the ELF, GAS and LD.  Library/kernel people
> tend just to "work with what's there" ... 

OK. Do you have an opinion on it, whether we should just drop the
legacy variant of the struct missing the space for floating point
registers, or introduce a personality framework to support two
different ABIs for the structure?

I would prefer the simpler approach (dropping the old struct) if
possible and I'm doubtful that any current software depends on it.

(musl does/will depend on it internally for thread cancellation to
work, but we don't even have support for the old struct right now
because the initial sh port was sh4-oriented and I wasn't even aware
of the nofpu struct variant at the time the port was added.)

> > > And the other option is not to use imask but TLS atomics ...
> > 
> > You mean soft-tcb? Those aren't even implemented in the kernel. 
> 
> I've added them mainly with non-linux in mind.

Understood.

> > And the requirement for the range of the offset conflicts with the TLS
> > model where those offsets belong to the application's initial-exec
> > TLS. This could be avoided by putting an object in crt1.o that would
> > 'automatically' get assigned the lowest offset to reserve it, but that
> > sounds like an ugly hack and would add TLS sections/program headers to
> > all programs.
> 
> The offset restriction can be lifted by using other GBR insns.  If
> there's interest, this can be done.

Negative offsets would at least make it compatible with the TLS ABI,
where the "TCB" is below the thread pointer rather than above.

Of course the TLS ABI design was bad to begin with. There's no
advantage to using the "Type I" form where TCB is below TP(GBR) and
application TLS is above. In theory you would have the advantage of
being able to use small immediate GBR offsets to access some
application TLS, but this can't be done because the compiler can't
know the offset the linker will assign to a particular object and
whether it will be "in range". But that ship has already sailed. I
would strongly oppose doing a gratuitously different TLS ABI just to
fix this; IMO it would only be interesting/worthwhile if doing a new
radically different ABI.

> > Anyway this response has already gotten really long so I'm dropping
> > off here for a bit (sorry I didn't get to the end). I think it would
> > be helpful to try to focus further discussion and possibly split it up
> > into new topic-specific threads on the appropriate lists. Feel free to
> > Cc me right away on any new SH threads you start.
> 
> In my opinion, the SH ABI compatibility problem should be addressed at a
> global scope, not focusing/limiting a solution to a particular ABI
> subset (like atomics) but it'll probably be a lot of work.
> 
> Currently we have multilibs which actually work OK.  The compiler should
> just list more permutations for targeting different kinds of systems.
> This can be used to support your current atomics implementation in musl
> by providing a multilib for "atomic function calls" (BTW the file
> libgcc/config/sh/linux-atomic.c in GCC might be of interest, too)

Multilibs solve a completely different problem than forward-compatible
binaries; they're just a shortcut to avoid neeing multiple toolchain
variants for different targets. But I generally prefer just having
multiple toolchains so all you need to do is set CC or even just
CROSS_COMPILE rather than worrying with CFLAGS and whether a program's
build process will pass them through right or drop them somewhere...

I realize binary deployability may not seem the most interesting or
agreeable goal on the FSF side, but I think it is worthwhile. I would
much rather have works-anywhere busybox, etc. binaries I can drop onto
an embedded device when exploring/extending it than have to first
figure out the exact ISA-level/ABI and build the right binaries for
it.

> One thing that is missing is fine-grained encoding (and checking) of the
> current ISA/ABI variations and features that are used in the ELF.
> Initially these flags/attributes can be some hardcoded sets selected
> with --isa options (or compiler's -m options).  Then GAS directives can
> be added to get more fine grained control via target function
> attributes.  Later the compiler and/or GAS/LD can emit them
> automatically (based on the code they produces).
> 
> The other thing is that the current SH ABIs are incompatible with each
> other and there are actually a couple of deficits.  Some of them are
> mentioned here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56592

__fpscr_values was already fixed.

PR 52441 (Double sign/zero extensions for function arguments) sounds
like mostly a non-issue to me. Smaller-than-int arguments are not
conventionally used in C, largely as a historic carry-over from K&R.
ABI-related performance issues with using smaller types probably exist
on various targets, but they generally only affect non-idiomatic code.
But if/when introducing a new ABI it wouldn't hurt to fix this.

"Boolean function return values" sounds like more trouble than it's
worth.

Other proposals sound worthwhile (if only incremental) and could
justify an additional ABI as long as the total gains are significant.
But see below.

> The -fpu vs -nofpu problem can be solved as it's done on ARM with a
> soft-float calling convention (passing all args in GP regs).  Maybe it'd
> make sense to define a new ABI for compatibility.

No new ABI is needed for something analogous to ARM's "softfp"; the
whole point is that the ABI is the same, but use of fpu is allowed
"under the hood".

In general "new ABI" is something I don't like. My view is that "ABI
combinatorics" (or config combinatorics in general) is a huge part of
what lead to uClibc's demise. When there are too many combinations to
test (and often more combinations than actual regular ongoing users!)
it's impractical to keep code well-maintained.

Rich

Follow-Ups:
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Oleg Endo

References:
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Oleg Endo
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Rich Felker
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Oleg Endo

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*

Re: [PATCH] enable fdpic targets/emulations for sh--linux*