[PATCH] enable fdpic targets/emulations for sh--linux*

Sat Oct 3 18:59:00 GMT 2015

On Sat, Oct 03, 2015 at 06:04:19PM +0900, Oleg Endo wrote:
> On Fri, 2015-10-02 at 13:52 -0400, Rich Felker wrote:
> > > 
> > > We get around 5 cycles on SH4 (the SH4A LLCS version is 4 cycles).  So
> > > it's not much slower than a non-atomic read-modify-write sequence.
> > > 
> > > If you pipe it through function calls:
> > > 	mov.l	.L3,r0
> > > 	mov	#5,r6
> > > 	jmp	@r0
> > > 	mov	#1,r5
> > > ..L4:
> > > 	.align 2
> > > ..L3:
> > > 	.long	___atomic_fetch_add_4
> > > 
> > > That's about 4 cycles just for the function call alone.  Without any reg
> > > saves/restores.  Plus increased probability of a icache miss yada yada.
> > > At best, this is twice as slow as inlined.
> > 
> > That's not PIC-compatible, and it also requires additional branch
> > logic in the called function. So I think it's a lot worse in practice
> > if you do it that way.
> 
> It was just an example of a minimal function call to demonstrate that
> the smallest possible overhead of atomics-via-calls is 2x.
> 
> >  I would aim to expose a function pointer for
> > the runtime-selected version and inline loading that function pointer.
> 
> Sure, that can be done, too.  Actually, you can have the function
> pointer table in the TLS, which makes it reachable via GBR:
> 	mov.l	@(disp, gbr), r0
> 	jsr	@r0
> 	nop

Again, that's unfortunately not possible because positive offsets from
GBR belong to the application's initial-exec TLS. The TLS ABI really
should have defined GBR to point 1024 bytes below the start of TLS
rather that at the start of TLS, so that up to 1k of TCB space could
be accessed via the short/fast GBR-based addressing. This would not
require reserving that much actual space (which would be a horrible
idea -- huge waste of memory per thread) but would just allow it it to
be assigned from the end downwards as needed. This is what most other
risc archs with limited-range immediates did.

In the big scheme of TCB access it probably doesn't matter. You can
just do:

	stc gbr,r0
	add #imm,r0
	mov.l @r0,...

But it probably makes it hard to make storing atomic code pointers in
the TCB worthwhile compared to just storing them in globals accessed
via the GOT.

> Because the run-time selection has to be done only once during loading,
> it'd always point to the right function / set of functions.
> There are around 15 * 3 = 45 different __atomic / __sync functions.  If
> having a 45*4 = 180 bytes atomics function table in the TLS is not good,
> it could be just one pointer to the set of selected functions.  The
> function itself will then have to be selected by adding a known constant
> before the jump/call.

Not only is it somewhat unreasonable to have that much TLS waste per
thread; you also do not want to be static linking that much unused
code into every static-linked program. Most programs probably use none
of those, and those which do use atomics probably only use one or two
(e.g. atomic inc and atomic cas, both 32-bit).

> > I would also aim to make the calling convention avoid needing a GOT
> > pointer in the callee and avoid clobbering pr; this can be done by
> > using a sequence like:
> > 
> > 	mova .L1,r0
> > 	mov.l @r0+,rn
> > 	...
> > 	braf r0
> > 
> > .L1:	.long whatever
> > 	...
> > 	[next code]
> > 
> > and the callee returns by jumping to the value it received as r0 or
> > similar.
> 
> If there are other function calls around in that calling function it
> won't be a win because PR will be clobbered anyway.

Indeed. But ideally functions which perform locking are either leaf
functions or have a shrink-wrappable code path that should avoid
setting up a call frame and saving the return address. I doubt the
current sh backend makes any such optimizations, so before we even
think about ugly micro-optimization hacks that require complex
cooperation between different parts of the toolchain and runtime code,
I think we should focus on the big performance problem that would make
a much much bigger difference: very bad codegen by gcc. Aside from
lack of shrink-wrapping, poor hanling of the PIC register (like the
way x86 used to handle %ebx, as permanently-reserved and unusable)
stands out at something that needs to be fixed.

> > IFUNC is rather a mess and non-solution (this has been discussed a lot
> > in the musl community) and it's not clear how to make it work with
> > static linking at all.
> 
> Any refs to those discussions?

I'd have to dig them up, but the TL;DR is lack of any spec for what
the ifunc resolver function can legally call, ordering issues between
ifunc resolvers, etc. and of course static linking.

> > OK. Do you have an opinion on it, whether we should just drop the
> > legacy variant of the struct missing the space for floating point
> > registers, or introduce a personality framework to support two
> > different ABIs for the structure?
> 
> Sorry, no, I don't have any opinions w.r.t. linux at the moment.

OK.

> > Negative offsets would at least make it compatible with the TLS ABI,
> > where the "TCB" is below the thread pointer rather than above.
> 
> The resulting sequence would look something like this:
> 	mova	1f,r0
> 	mov	r0,r1			// exit point during sequence in r1
> 	mov.l	.Loffset,r0		// or something else to get the constant
> 	or.b	#(0f-1f),@(r0,gbr)	// set sequence length and enter sequence
> 0:	mov.l	@r4,r1
> 	add	#1,r1
> 	mov.l	r1,@r4
> 1:	and.b	#0,@(r0,gbr)		// exit and clear sequence length
> 
> This would allow negative offsets.  However, because of the GBR logical
> insns it'll be slower.  We can also lift the offset restriction of the
> current implementation by not using @(disp,GBR) type insns if the
> specified offset is not in the range as required by the insns.  Please
> open a new GCC PR for this, if you're interested in that.

Conceptually I am interested, but I'm not convinced there's any
practical problem we'd solve, at least not on Linux which is my main
current focus.

I do like your above trick for using negative offsets efficiently. BTW
for small negative offsets (which are the only reasonable ones) you
can avoid the .Loffset and just use an immediate.

This would work for stack protector storing the canary at the end of
the TCB too -- and that would be something interesting to do.

> > Multilibs solve a completely different problem than forward-compatible
> > binaries;
> 
> I'm still not sure I understand your definition of "forward-compatible"
> binaries here.  According to my understanding a binary can't really be
> "forward-compatible", unless somebody can precisely predict the future.
> A system can be backward-compatible by being able to run older binaries
> in some way.  Can you please clarify the meaning of
> "forward-compatible"?

For most archs it's very simple -- you have a linear progression of
ISA levels/models, and forwards-compatible just means anything built
with the -march for ISA level A runs on a host with ISA level B, for
any A<B.

For SH it's not exactly a linear progression, but as long as you have
a partial order on ISA levels/cpu models, you can define the same
concept; some levels just become non-comparable. Technically J2 and
SH4A are not comparable because SH4A does not have cas.l, but I think
we're treating cas.l as an optional instruction on J2 (note: current
released builds of the bitstream don't have it) which, on Linux, will
be reported to userspace via a bit in AT_HWCAP. So in this sense, J2
is a sub-ISA of SH4A, where both have "may exist at runtime" cas.l
except that on SH4A the predicate is always-false. :-)

Note that the main big real-world obstacle to forward-compatibility
through an ISA progression is lack of proper atomics/barriers on old
versions of the ISA. Whereas most code for an older ISA runs fine on a
superset of that ISA, if the old ISA lacked real atomics/barriers and
the newer model supports SMP, you're pretty much completely out of
luck. The only hope for the code running without knowledge and
conditional use of the newer ISA extensions is that the OS can
reliably notice and trap whatever old simulated atomics were used and
convert them to something that synchronizes memory. I advised the
OpenRISC developers on this issue early in their porting of musl to
or1k and quickly got real atomics added to the ISA so that they
wouldn't run into a nasty issue like this in the future. OTOH
Linux/MIPS handled the issue just by pretending all MIPS ISA levels
have the ll/sc instructions and requiring the kernel to trap and
emulate them on ancient hardware. That would have worked for J2 as
well but would have given really really bad performance.

> > I realize binary deployability may not seem the most interesting or
> > agreeable goal on the FSF side, but I think it is worthwhile. I would
> > much rather have works-anywhere busybox, etc. binaries I can drop onto
> > an embedded device when exploring/extending it than have to first
> > figure out the exact ISA-level/ABI and build the right binaries for
> > it.
> 
> I think what you describe is more the situation/convenience we have with
> desktop systems.  This compatibility has some price/inefficiency tag
> attached to it.  In embedded systems the whole system is often modified,
> tuned and rebuilt from source/scratch (e.g. buildroot).  Of course it's
> possible to define what "compatibility" means for desktop-SH.  But I
> guess for a niche system/market it's easier to say: "user, for your
> system, please use the binaries/toolchain from this subdirectory".  If
> building those different variants is a problem, then this condition can
> be improved in other ways.

Things have always been done that was with uClibc, but that doesn't
mean it's the right way; I'm trying to do something better with musl.
Trying to micro-optimize out every single code path you possibly can
with highly target-specific knowledge is simply not an efficient path
to small size and performance; it takes too much human maintenance
effort and distracts from the real opportunities for big gains from
higher-level optimizations. As an example, the uclinux msh I started
with for SH2 is something like 48k static-linked as bFLT against the
uClibc we had (from an old Codesourcery toolchain, I think) and 19k
static-linked against musl as FDPIC ELF.

> > > The -fpu vs -nofpu problem can be solved as it's done on ARM with a
> > > soft-float calling convention (passing all args in GP regs).  Maybe it'd
> > > make sense to define a new ABI for compatibility.
> > 
> > No new ABI is needed for something analogous to ARM's "softfp"; the
> > whole point is that the ABI is the same, but use of fpu is allowed
> > "under the hood".
> 
> Right.  You can define the SH2-nofpu ISA/ABI as the base level for your
> system.  Then anything higher than that has to be made backwards
> compatible.  This is what is currently not fully supported by the tools.
> An sh4-linux should be already able to run a fully self contained
> statically all-in linked sh2-linux program.

Modulo the sigcontext ABI issue and the gratuitously different syscall
trap numbers (the latter of which I have a pending kernel patch to
fix, but it's not getting any attention because there's no maintainer
for SH and without a maintainer nobody can really touch design/policy
type issues like this...).

> Mixing of pre-compiled
> libraries won't work.

Mixing sh2(nofpu) and sh4-nofpu libraries will work fine because they
use the same ABI (in the sense of calling convention).

The current spectrum of ABIs musl supports is (regex form):

sh(eb)?(-nofpu)?(-fdpic)?

Of course if you're running on actual sh2 hardware all the libs need
to refrain from using instructions from sh3/sh4/sh4a. But the same
dynamic binaries (built for sh2 ISA) can run just fine on sh4 (modulo
the sigcontext issue) with sh4-nofpu versions of the libraries
installed for better performance (and even with hard-float used
internally, like on ARM softfp).

> Of course the end result be an overall less efficient system and would
> be a step backwards in some cases (copying GP regs <-> FP regs,
> resetting  FPSCR.SZ/.PR, etc).  That's the price of backwards
> compatibility.

Yes.

> > In general "new ABI" is something I don't like. My view is that "ABI
> > combinatorics" (or config combinatorics in general) is a huge part of
> > what lead to uClibc's demise. When there are too many combinations to
> > test (and often more combinations than actual regular ongoing users!)
> > it's impractical to keep code well-maintained.
> 
> My point of view the tools (compiler,assembler,linker etc) should
> provide the options, and toolchain providers should configure it with
> reasonable defaults for their systems.  Yes, testing becomes a bit more
> difficult, but is not impossible.  Some combinations don't get tested
> and occasionally break.  That's life, I guess.
> 
> Going back to your original idea w.r.t. libatomic... a "clean" way of
> achieving what you want might be:
> 
> - add explicit -matomic-model=call
>   (which would also define the corresponding __SH_ATOMIC_MODEL_CALL__
>    and maybe implement some special ABI as above)
> 
> - add support to (somehow) allow different ABIs to be mixed within one
> ELF, e.g. --isa=sh2,sh3,sh4,sh4a,...
> 
> - maybe put the function table etc into libgcc
> 
> With that, there's no need for the libatomic dependency and the
> __atomic* primitives would "just work" (which in turn can be used by
> libatomic).  Then you can configure the toolchain for your system to use
> -matomic-model=call by default.

Something like that sounds okay.

> I wouldn't make it the default for
> sh4-linux or sh4a-linux though.  Those are not fully backwards
> compatible software systems. 

That's actually one of the biggest areas it's needed -- right now,
binaries built for sh4 are not safe to run on sh4a. Their atomics are
non-atomic on sh4a if it's SMP or if they're sharing memory with
programs using the real atomic instructions. This is the original
reason musl implemented the runtime selection of atomics, way before I
even thought about sh2 and nommu support.

Rich

[PATCH] enable fdpic targets/emulations for sh*-*-linux*

[PATCH] enable fdpic targets/emulations for sh--linux*