This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*


On Wed, 2015-09-30 at 14:38 -0400, Rich Felker wrote:
> > > > 
> > > > On the musl side, we have all atomics go through a function that
> > > > chooses which atomic to use based on runtime detection. LLSC (sh4a),
> > > > GUSA (sh3/4), and imask (sh2 single-core) are supported now and I'm
> > > > going to add j2 cas.l. For sh4a+ targets, this is optimized out and
> > > > the inline LLSC atomics are used.
> > 
> > How is this "optimized out" done?
> 
> #ifdef __SH4A__
> 
> If __SH4A__ is defined, this implies an ISA that includes the LLSC
> instrutions (movli.l and movco.l) so we can use them unconditionally.
> 
> We could do the same for the J2 cas.l instruction and a J2-specific
> macro, 

I'd rather add something like __SH_ISA_foo__ macros (__SH_ISA_SHLD__,
__SH_ISA_LLSC__, __SH_ISA_CASL__, ...).  After all, I thought the idea
of the J-core is to be flexible ... people can create various
variations ... with dynamic shifts, or without .. with cas.l or without
(why add the cas.l instruction to a custom single core chip?)

> but I want to be able to consider J4 (future), SH4 and SH4A as
> ISA-supersets of J2. The J4 will have cas.l but SH4 and SH4A obviously
> don't.

I think it's more like J2 and SH4A share some ISA subset, because SH4A
does not implement the cas.l insn.

> > That'd sort of defeat the purpose of compiler-provided inlined atomic
> > builtins.  Every atomic op would have to go through a function call.  A
> > safe but not a very attractive default, e.g. when one wants to build a
> > SH4-only system.  We can try to lower the function call overhead with
> > some special atomic library function ABI (as it's done for e.g. shifts
> > and division libfuncs), if that helps.
> 
> Perhaps, but GUSA-style atomics are likely so much faster than actual
> memory-synchronizing instructions that the total cost is still
> relatively small.

Sorry, I don't understand the relationship of function call overhead and
gUSA vs. memory synchronizing instructions ... The "problem" with real
function calls (vs specialized libcalls) is that more registers need to
be saved/restored than the function actually clobbers.

> I have considered doing a custom calling convention for this in musl's
> atomics (and "calling" the "functions" from inline asm) but before
> spending effort on that I'd want to see if there's actually a
> practical performance issue. I'm not a fan of premature optimization.

If you happen to find/create some benchmarks, please share.  I could not
find anything but micro benchmarks comparing the performance of an
atomic add to a non-atomic add in a tight loop.  Of course those are not
real-world scenarios of atomics usage in real programs.  I guess a good
benchmark would be to take a huge piece of software that uses e.g.
std::shared_ptr a lot.  On the other hand, there must be a reason why
quite a lot of compiler backends implement builtin atomics.  Avoiding
function call overhead around a single instruction (e.g. cas.l) or a few
instructions makes sense somehow.

> Oh, yes. But I don't see a good way to do this automatically without
> adversely affecting programs that use runtime selection. It would be
> really unfortunate for programs to get flagged as incompatible just
> because they contain code that's not compatible, when they're
> explicitly avoiding using that code.

If the ELF contains information about:
- which ISA/ABI/... it requires
- which ISA/ABI/... it provides
(- maybe something else  ...)

... then the loader can rather easily determine the compatibility of an
ELF and the system.

For example, if an ELF has been built for J2 and uses some atomics and
FP code, and somehow contains code for SH4A and J2 atomics (function
multi versioning maybe) it would say:

- requires: SH2 CPU, SW FPU, "FP-args in GP-regs", any compatible
atomics support
- provides: atomic-model=hard-casl, hard-llcs

A J2 or SH4A system would be able to run this ELF (by selecting the
according function versions in the ELF or telling the ELF how to select
them).  An SH4 (uni-core with gUSA) or SH2 (uni-core, TSL atomics) would
not because it can't do any of the atomic models provided by the ELF.

The ISA/ABI thing can be fine grained, down to individual instructions.
At least it should be a bit more fine grained than the current SH ELF
flags, because there clearly are more possible variants out there than
encoded in the ELF at the moment.

If we had something like this, the compiler can be taught to emit
multiple function versions automatically.  E.g. if a function contains
an atomic op and "atomic-model=hard-casl,hard-llcs" it will compile two
functions.  The same can also be applied for FP usage or any other
ISA/ABI feature like DSP extensions or whatever.  So you'd get some sort
of fat/universal-binary which can run on different architectures.  The
binary itself can do the runtime switching.  Or if the loader/linker is
sophisticated enough, it can just dynamically link/not link the
functions which are not needed.  E.g. if a binary built for SH2,SW-FPU
and SH4,SH4-FPU is loaded on an SH4 system, all the SH2,SW-FPU functions
would become unreferenced and hence will not get loaded.  The same will
happen at static link time if that binary/library is linked/stripped for
an SH4-only system.

Of course none of that will happen automatically at the moment, because
nothing implements it.  It's just my idea how it could work.  I think it
also could be an alternative or replacement for the current multilibs.

> There's a big problem here right now: the definition of
> sigcontext/mcontext_t is dependent on whether the _kernel_/_cpu_ has
> fpu, and thus programs built for the no-fpu layout are incompatible
> with kernels for cpu-supporting cpus. (and sh4-nofpu ABI binaries
> cannot run _anywhere_ correctly). I want to fix this in the kernel by
> just always using the same structure layout with space for fpu
> registers (and possibly having a personality bit for the old one if
> people think it's necessary for backwards-compat) but the fact that
> there's presently no maintainer for SH makes it really hard to advance
> any changes that would be "policy"-like... :(

I think this is beyond the scope of this list.  But maybe now is a good
time to straighten some things here and there...

> > > A program compiled to use imask will
> > > crash on cpus/kernels with privilege enforcement (although perhaps it
> > > could trap and emulate).
> > 
> > Yes, that's one option.

And the other option is not to use imask but TLS atomics ...
 
> The most important usage case I can think of right off is being able
> to build software whose build systems are not cross-friendly by
> running on the higher-performance MMU-ful hardware at build time.

Not sure I get it ... so you want to have the benefits of a self-hosted
system, on a target which actually can't self-host itself?  Why not
simply build/install a cross toolchain configured properly for the
target?

> I don't know if J2/SH2 -> SH4A is likely to be a real-world transition
> path anyone would actually care about, mainly because I don't have any
> experience with SH4A hardware.

I guess this depends on the objectives of the J-core developers.

> But I also think it's a much bigger
> mess to be trying to figure out which transition paths make sense and
> building a forward-compatibility model around the resulting
> assumptions than to just use simple runtime detection.

Sure, runtime switching is one option.  But I don't understand what you
mean by "forward compatible" here...
The transition path wouldn't be much of a problem if there was something
like my idea above (requires, provides feature flags etc) ... it would
either start/link and run or not.


> Programs that actually need high-performance atomics on multiple
> objects with low latency between them will likely ignore
> forward-compatibility concerns and simply hard-code whatever option
> makes sense for the hardware they're targeted to. But this is likely
> to be a very small minority of programs.

Well, actually I've seen quite some software out there, implementing
their own inline-asm atomics for various architectures ... just the way
you do it in musl.  The alternative is to just use compiler built-ins
and specifying the target CPU/options in the build system ...

> > Anyway, I still think there should be more flags or SH attributes in ELF
> > for encoding all the various ABIs and options.
> 
> I'm not opposed to this as long as they don't break usage cases that
> otherwise would/should work. One approach I'd be happy to see is first
> adding support for runtime switching in gcc and making that the
> default,

I wouldn't make it the default (yet).  But there could be a configure
option to create a toolchain where it's the default.

>  then setting the ABI flags via new gas directives if the
> default is overridden and the choice requires a specific cpu model
> that's not forwards-compatible.

The --isa options can also be extended to accept multiple values, but it
might be difficult to distinguish which functions are for which ISA/ABI.
So for GAS, directives make sense.  I guess target function attributes
would complement it ( https://gcc.gnu.org/wiki/FunctionMultiVersioning )

The --isa option would make sense for LD though, when one wants to link
multi-verioned libraries into a SH2 only binary image (stripping out the
unwanted code at link time).

Cheers,
Oleg


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]