This is the mail archive of the
mailing list for the binutils project.
Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
- From: Rich Felker <dalias at libc dot org>
- To: Oleg Endo <oleg dot endo at t-online dot de>
- Cc: binutils at sourceware dot org
- Date: Thu, 1 Oct 2015 12:46:30 -0400
- Subject: Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
- Authentication-results: sourceware.org; auth=none
- References: <20150929235801 dot GA8408 at brightrain dot aerifal dot cx> <1443612038 dot 2509 dot 140 dot camel at t-online dot de> <20150930142533 dot GC8645 at brightrain dot aerifal dot cx> <20150930143555 dot GD8645 at brightrain dot aerifal dot cx> <1443627005 dot 2509 dot 189 dot camel at t-online dot de> <20150930183810 dot GE8645 at brightrain dot aerifal dot cx> <1443715139 dot 2031 dot 134 dot camel at t-online dot de>
On Fri, Oct 02, 2015 at 12:58:59AM +0900, Oleg Endo wrote:
> On Wed, 2015-09-30 at 14:38 -0400, Rich Felker wrote:
> > > > >
> > > > > On the musl side, we have all atomics go through a function that
> > > > > chooses which atomic to use based on runtime detection. LLSC (sh4a),
> > > > > GUSA (sh3/4), and imask (sh2 single-core) are supported now and I'm
> > > > > going to add j2 cas.l. For sh4a+ targets, this is optimized out and
> > > > > the inline LLSC atomics are used.
> > >
> > > How is this "optimized out" done?
> > #ifdef __SH4A__
> > If __SH4A__ is defined, this implies an ISA that includes the LLSC
> > instrutions (movli.l and movco.l) so we can use them unconditionally.
> > We could do the same for the J2 cas.l instruction and a J2-specific
> > macro,
> I'd rather add something like __SH_ISA_foo__ macros (__SH_ISA_SHLD__,
> __SH_ISA_LLSC__, __SH_ISA_CASL__, ...). After all, I thought the idea
> of the J-core is to be flexible ... people can create various
> variations ... with dynamic shifts, or without .. with cas.l or without
> (why add the cas.l instruction to a custom single core chip?)
Multi-core is in the works; it just requires a larger fpga. We don't
yet have kernel support for SMP though. I'm going to start testing
> > but I want to be able to consider J4 (future), SH4 and SH4A as
> > ISA-supersets of J2. The J4 will have cas.l but SH4 and SH4A obviously
> > don't.
> I think it's more like J2 and SH4A share some ISA subset, because SH4A
> does not implement the cas.l insn.
Yes, that's why I said "consider as" -- I meant it in the sense of a
compatibility path/hierarchy, not strict ISA.
> > > That'd sort of defeat the purpose of compiler-provided inlined atomic
> > > builtins. Every atomic op would have to go through a function call. A
> > > safe but not a very attractive default, e.g. when one wants to build a
> > > SH4-only system. We can try to lower the function call overhead with
> > > some special atomic library function ABI (as it's done for e.g. shifts
> > > and division libfuncs), if that helps.
> > Perhaps, but GUSA-style atomics are likely so much faster than actual
> > memory-synchronizing instructions that the total cost is still
> > relatively small.
> Sorry, I don't understand the relationship of function call overhead and
> gUSA vs. memory synchronizing instructions ... The "problem" with real
> function calls (vs specialized libcalls) is that more registers need to
> be saved/restored than the function actually clobbers.
Yes, I was just abstracting the time cost of those clobbers. Of course
there is some code-size issue too that I failed to address.
> > I have considered doing a custom calling convention for this in musl's
> > atomics (and "calling" the "functions" from inline asm) but before
> > spending effort on that I'd want to see if there's actually a
> > practical performance issue. I'm not a fan of premature optimization.
> If you happen to find/create some benchmarks, please share. I could not
> find anything but micro benchmarks comparing the performance of an
> atomic add to a non-atomic add in a tight loop. Of course those are not
> real-world scenarios of atomics usage in real programs. I guess a good
> benchmark would be to take a huge piece of software that uses e.g.
> std::shared_ptr a lot. On the other hand, there must be a reason why
> quite a lot of compiler backends implement builtin atomics. Avoiding
> function call overhead around a single instruction (e.g. cas.l) or a few
> instructions makes sense somehow.
I think even something like a loop of (pthread) lock/x++/unlock with
no contention would be an interesting case to check to see if it's
even plausible that this matters to real-world performance for
high-level primitives. For maximizing the relative time spent in lock
functions and their atomics, that's the most extreme practical
real-world use. If the difference is not seen there, it should not
matter in other real-world use. If it is seen there, more study on
real-world cases is needed.
> > Oh, yes. But I don't see a good way to do this automatically without
> > adversely affecting programs that use runtime selection. It would be
> > really unfortunate for programs to get flagged as incompatible just
> > because they contain code that's not compatible, when they're
> > explicitly avoiding using that code.
> If the ELF contains information about:
> - which ISA/ABI/... it requires
> - which ISA/ABI/... it provides
> (- maybe something else ...)
> .... then the loader can rather easily determine the compatibility of an
> ELF and the system.
> For example, if an ELF has been built for J2 and uses some atomics and
> FP code, and somehow contains code for SH4A and J2 atomics (function
> multi versioning maybe) it would say:
> - requires: SH2 CPU, SW FPU, "FP-args in GP-regs", any compatible
> atomics support
> - provides: atomic-model=hard-casl, hard-llcs
> A J2 or SH4A system would be able to run this ELF (by selecting the
> according function versions in the ELF or telling the ELF how to select
> them). An SH4 (uni-core with gUSA) or SH2 (uni-core, TSL atomics) would
> not because it can't do any of the atomic models provided by the ELF.
This all sounds pretty reasonable.
> The ISA/ABI thing can be fine grained, down to individual instructions.
> At least it should be a bit more fine grained than the current SH ELF
> flags, because there clearly are more possible variants out there than
> encoded in the ELF at the moment.
> If we had something like this, the compiler can be taught to emit
> multiple function versions automatically. E.g. if a function contains
> an atomic op and "atomic-model=hard-casl,hard-llcs" it will compile two
> functions. The same can also be applied for FP usage or any other
> ISA/ABI feature like DSP extensions or whatever. So you'd get some sort
> of fat/universal-binary which can run on different architectures. The
> binary itself can do the runtime switching. Or if the loader/linker is
> sophisticated enough, it can just dynamically link/not link the
> functions which are not needed. E.g. if a binary built for SH2,SW-FPU
> and SH4,SH4-FPU is loaded on an SH4 system, all the SH2,SW-FPU functions
> would become unreferenced and hence will not get loaded. The same will
> happen at static link time if that binary/library is linked/stripped for
> an SH4-only system.
> Of course none of that will happen automatically at the moment, because
> nothing implements it. It's just my idea how it could work. I think it
> also could be an alternative or replacement for the current multilibs.
I'm still really skeptical that all of this multiversioning stuff
(e.g. building multiple versions of individual functions rather than
just calling atomic functions or 'function fragments' with multiple
backends) is premature optimization and a lot of
complexity/infrastructure for little or no goain. (And if the concern
is size, the runtime code to switching out code with [dynamic-]linker
like mechanisms is probably much larger than any savings you could
get.) It also seems to require serious forethought about possible
targets the binary could be moved to rather than just working by
The ideal solution in my mind is that applications that really care
(or think they should care) about size or performance benefits of
inlining model-specific atomics can use the right -matomic-model
option and get marked as only working on certain cpu models, and
otherwise (the majority of) apps where it's not going to matter just
work by default on anything.
> > There's a big problem here right now: the definition of
> > sigcontext/mcontext_t is dependent on whether the _kernel_/_cpu_ has
> > fpu, and thus programs built for the no-fpu layout are incompatible
> > with kernels for cpu-supporting cpus. (and sh4-nofpu ABI binaries
> > cannot run _anywhere_ correctly). I want to fix this in the kernel by
> > just always using the same structure layout with space for fpu
> > registers (and possibly having a personality bit for the old one if
> > people think it's necessary for backwards-compat) but the fact that
> > there's presently no maintainer for SH makes it really hard to advance
> > any changes that would be "policy"-like... :(
> I think this is beyond the scope of this list. But maybe now is a good
> time to straighten some things here and there...
Yes, I just raised it again because it's been a dead-end where it was
discussed before, and I think it's the elephant in the room when
you're considering SH ABI issues. Do you have a proposed place we
should discuss this with the intent of actually reaching consensus on
a resolution with anyone who may be affected by it?
> > > > A program compiled to use imask will
> > > > crash on cpus/kernels with privilege enforcement (although perhaps it
> > > > could trap and emulate).
> > >
> > > Yes, that's one option.
> And the other option is not to use imask but TLS atomics ...
You mean soft-tcb? Those aren't even implemented in the kernel. And
the requirement for the range of the offset conflicts with the TLS
model where those offsets belong to the application's initial-exec
TLS. This could be avoided by putting an object in crt1.o that would
'automatically' get assigned the lowest offset to reserve it, but that
sounds like an ugly hack and would add TLS sections/program headers to
> > The most important usage case I can think of right off is being able
> > to build software whose build systems are not cross-friendly by
> > running on the higher-performance MMU-ful hardware at build time.
> Not sure I get it ... so you want to have the benefits of a self-hosted
> system, on a target which actually can't self-host itself? Why not
> simply build/install a cross toolchain configured properly for the
Restricting to build systems/applications that are cross-compile
friendly is still a fairly big constraint. Ideally everything would
just work but it's not so simple. Buildroot does a fairly good job of
making this work; Aboriginal Linux on the other hand wants to get a
native environment with minimal cross-compiling and then be able to
build further software in a native environment. And there's a good
deal of software whose build systems are broken for cross-compiling;
anything using gnulib still has a number of potential issues for a
musl-based host because it makes pessimistic assumptions about which
standard functions are broken then tries to replace them in
But anyway this is getting pretty off-topic...
> > I don't know if J2/SH2 -> SH4A is likely to be a real-world transition
> > path anyone would actually care about, mainly because I don't have any
> > experience with SH4A hardware.
> I guess this depends on the objectives of the J-core developers.
There's not a single objective/vision here; I'm trying to reflect a
spectrum of interests. Part of this is my interest from the basis of
musl's philosophy on things like minimizing arch-specific code and
behavior, widely deployable binaries that are not tightly linked to a
particular host machine or fs layout etc. (for example, we put a lot
of research effort into making armv5 binaries normally using
kuser_helper compatible with v6/v7 kernels where kuser_helper has been
removed from the kernel for hardening). But it's also related to the
0pf.org vision for J-Core (http://0pf.org/j-core.html):
Architecture: Our general philosophy is the following:
- ISA honors instructions sets from old CPUs.
- Preexistent executables from old CPUs runs on J Series.
- J executables runs on future J Series.
Anyway this response has already gotten really long so I'm dropping
off here for a bit (sorry I didn't get to the end). I think it would
be helpful to try to focus further discussion and possibly split it up
into new topic-specific threads on the appropriate lists. Feel free to
Cc me right away on any new SH threads you start.