This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB

From: OndÅej BÃlka <neleai at seznam dot cz>
To: David Edelsohn <dje dot gcc at gmail dot com>
Cc: Torvald Riegel <triegel at redhat dot com>, Richard Henderson <rth at twiddle dot net>, munroesj at us dot ibm dot com, szabolcs dot nagy at arm dot com, Carlos Eduardo Seo <cseo at linux dot vnet dot ibm dot com>, GLIBC Devel <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>
Date: Tue, 30 Jun 2015 22:56:39 +0200
Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
Authentication-results: sourceware.org; auth=none
References: <CAGWvnynvibieMA_7D3-DnNG-BFRLQsn4OeOv_=r1gKyDpMgRXw at mail dot gmail dot com>

On Tue, Jun 30, 2015 at 01:49:26PM -0400, David Edelsohn wrote:
> On Tue, 2015-06-30 at 18:01 +0200, Torvald Riegel wrote:
> 
> > I don't think we promise to do everything for everyone.  That does not
> > conflict with free software.
> 
> The request is not "do everything for everyone".  That is a strawman argument.
> 
> The issue is: should the GLIBC community and leaders use their
> position to pick favorites for architectures and ABIs.  If the GLIBC
> community wants to say that GLIBC features and architecture and
> framework will be designed for the most prevalent, commodity
> architecture and ABI (Intel x86-64) and all other architectures can
> live on the table scraps when the benevolent dictators magnanimously
> choose to throw them a bone, that's fine as long as it is explicit.
> 
> The proposal is not a request to restructure core, common parts of
> GLIBC in badly designed ways.  The proposal is not advocating use of
> this feature in this programming style as exemplary.  It's a
> least-bad, target-specific patch to deal with a performance problem of
> real-world, customer code seen in the field.
> 
> Admonishing IBM that it should lecture customers on how to write their
> code or essentially declaring that the idioms adopted by most
> programmers because of Intel x86 prevalence are somehow preferred as
> if they were handed down on silver tablets is arrogant and naive.
> 
> If you make your garden too exclusive, don't be surprised when people
> plant alternative gardens.
>
Main program here is that Steve doesn't answer questions to convince us
and ignores our suggestions.

First I objected that if only one application in 1000 uses hwcap and
saves 100 cycles then it could be overall loss when it wastes cycle in
each other application. Could you clarify why it wouldn't be problem to
setup hwcap for each thread even when not used?

After I asked why use tcb and you couldn't just define

extern int __hwcap, __hwcap2;

Steve answered to save cost of plt. That lead to discussion as its
common practice when doing instruction selection to minimize selection
overhead to be less than 1% by using bigger granularity.

Then Florian found excellent solution to encode hwcap into address. A
loader will setup entries so you could get hwcap by using.

(int) &__hwcap_hack

That is generic as its cross-platform and allows you to quickly access
any run-time constant. Also it avoids problem that you need to do setup
on per-thread basis.

Steve just ignored that.

As better alternatives these have nothing common in x86-64 as they are
relatively generic. I repeatedly asked if just using appropriate -mcpu 
would lead to bigger performance gain than trying to use hwcap. 
I didn't get answer, could you clarify?

Also could you provide example where a difference of plt indirection in
hwcap would make a difference. Steve's example was to make binary
optimized for power8 that could be run on power7 for verification
purposes.
Do you think that its good idea to use hwcap for that instead making
testing and production version, where testing will use power7 and
production would be compiled with power8 support which is faster due
omitted hwcap checks?

Main problem here is that distro maintainers don't like increased size.

0. Adapting gentoo model. When user knows that he won't change cpu he
could compile all open source packages from source with appropriate
flags. That could be done by background process that looks for source
packages in any distribution. That requires nobody's permission.
All hwcap checks are optimized out.

1. A fat binary/library approach. Steve told that average programmer
won't use AT_LIBRARY. I said that he would with appropriate packaging.

For shared libraries it be too hard to ask programmer/package maintainer to do
following two things?

1. Add -mmulticpu flag when compiling.
2. When copying .so use replace cp with multicpu_cp.

That could be done by relative simple gcc wrapper that when compiling
will subdirectories for each AT_LIBRARY choice, compile object files with 
appropriate flags  to that directories and when linking would again iterate 
AT_LIBRARY choices and replace foo.o arguments with at_library_subdir/foo

That eliminates need to optimize hwcap performance at all as you could
optimize them as constant.

For binaries you would use same approach but add constructor to select
real binary with
if (hwcap & HAVE_FOO)
  execv(strcat(cwd, "/path/app"), argv);

or fallback to default implementation if these are not accessible.

My option is that this is best alternative form 

He also sayed that ifunc are too complicated.

Again what is so complicated on neatly packaged ifunc.

All a application programmer need to do is to 

__attribute__((multicpu)) foo (vec y, vec z)
{
  return instric (y, z);
}

And gcc would generate ifunc and variants for all cpu for him. Again any
hwcap tests could be optimized out.

This is again simple to implement, for example as macro with bit harder
syntax

multicpu (int, foo, (x, y) (double x, double y))
{
  return x * y;
}

with

#define multicpu(tp, name, arg, tparg) \
tp __##name tparg; \
tp __##name##_power5 tparg __attribute__((__target__("cpu=power5")))\
{ \
  return (tp) __##name arg; \
} \
tp __##name##_power6 tparg __attribute__((__target__("cpu=power6")))\
{ \
  return (tp) __##name arg; \
} \
tp name tparg \
{ \
 /* select ifunc */ \
} \
tp __##name tparg

So where are x64 specific bits?

References:
- Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
  - From: David Edelsohn

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]