This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Why do you want libc to be 5 times slower than other libraries?

On Wed, 2008-08-06 at 10:04 +0200, Agner Fog wrote:
> Thank you for a thorough and detailed answer.

> This means a lot of extra work reinventing the wheel and solving 
> problems that others have already solved. Not exactly what the idea of 
> the open source movement is....

The Free Software Foundation is a big juicy target for litigation.
They're very careful about what code they accept and they require that
the pedigree of the code be rock solid (i.e. you can't copy someone's
code, modify it, and then submit it for inclusion).

> Maybe other open source projects are ï
> willing to make a license-sharing agreement so the different projects 
> can benefit from each other rather than working independently on the 
> same problems. I explicitly stated in my mail that I was willing to 
> assign the necessary rights of my code to the Gnu project.

<quote>"At least I am willing to grant the necessary licenses to the
Gnu/libc project if you want to use my code."</quote>

In our world license and copyright ownership are two entirely different
things.  Excuse me for being pedantic but it is necessary.  We'll
eventually need you to explicitly assign copyright for any code you
contribute directly to the FSF.

> I am fully aware of that, and I would do it all if I had the time. 
> Unfortunately, I haven't. It would be a lot of work for me just to get 
> into the proper procedures, and I would still get complaints about using 
> the wrong type of tabs and spaces or whatever. I am testing different 
> libraries and different algorithms and telling you which one is fastest 
> and which ones can be improved. I am offering you optimized code, but I 
> am not offering the tedious work of fitting it into the form required 
> for libc.

Unfortunately, "the tedious work of fitting it into the form required
for libc" is where most of the work is and no-one likes to do that foot
work.  I'm sitting on some patches myself for memcpy on the PowerPC CELL
PPE processor because the integration work is time consuming.

I recommend contacting the aforementioned developers at Intel and AMD to
see where they're at, what they've got planned for the future (as far as
patches), what's missing, and how your contribution can fit in.

> Maybe the list descriptions need updating:
> "The libc-alpha list is for the discussion of glibc development"
> "The libc-help list is intended for all glibc questions including build 
> problems, C library usage, and more"

Fair enough.  Which webpage are you referring to?  Carlos O'Donnel can
take care of it.

>  >d.) Your email didn't indicate how you gathered your data or
>  >whether you verified that what you were testing is an optimized version
>  >of the code for the processor in question. It is up to the Linux OS
>  >distributor to decide whether to compile and ship a CPU optimized
>  >library for a particular CPU or CPU subtype with their distribution.
>  >Did you compile your own versions of GLIBC for your tests? Are you sure
>  >you distribution isn't using the default (non-cpu specific) string 
> routines?
> It is not optimized for a specific CPU, that't indeed the problem. I 
> couldn't find any implementation of libc that has different branches for 
> different CPU's, e.g. SSE2, SSE3, Intel SSE4, AMD SSE4, etc.

For the PowerPC architecture we've integrated the powerpc-cpu framework
into GLIBC mainline.  The same framework could be used for the Intel/AMD
family of processors.  No-one has done the foot work for this thus far.
I think they wanted to take a different approach.  More on that later.

> Does such a CPU dispatching exist in libc? How does it work? It should 
> be possible to compile a static binary on a system with SSE-whatever, 
> and run it on a system with SSE-something-else. Therefore, I want the 
> CPU-dispatching to be inside libc.

We (IBM) had discussions with AMD and Intel at the 2007 GCC Summit where
they indicated that they were interested in dynamic runtime checks for
hardware capability which would route the application to the correct CPU
optimized function implementation while the application was running by
using a first-time-called hwcap check.

ïThe 'first-time-called' hwcap check would work by having a wrapper
function check to see if it had an internal function pointer set for an
optimized version of the function.  If not, then it'd check the hwcap
for the specific platform information, find the correct function pointer
and set it.  Subsequent calls wouldn't pay this resolution
penalty.  ïI'm not sure if they made any progress on this.  H.J. Lu at
Intel would probably be able to tell you.

On PowerPC we've taken a predetermined approach.

What we have for the PowerPC architecture is the powerpc-cpu framework.

gcc -mcpu=power6 app.c -o app

This compiles an application with the POWER6 instruction set.  When it
is executed on a system the dynamic link loader ( will look at the
aux vector for the AT_PLATFORM.  If AT_PLATFORM is 'power6' then the
dynamic link loader will load /lib/power6/ which includes the
POWER6 optimized string functions, e.g. memcpy, memset, et al.

In a static linking scenario, if you compiled against POWER6 and linked
against a POWER6 and then tried to run 'app' on a POWER5 you'd
run into problems if the POWER6 compiled application attempted to use an
instruction from the POWER6 instruction set.

> I agree. The performance difference is highest when data are in the 
> level-1 cache and aligned by less than 16. I just didn't want to bother 
> you with excessive data when the main conclusion is so clear. The bottom 
> line is that memory and string functions in libc have poor performance 
> because you are not using XMM registers and you have no efficient way of 
> dealing with unaligned data. The most efficient way of copying data when 
> source and destination have different alignments is to read aligned into 
> XMM registers; shift and combine consecutive reads so that they fit the 
> alignment of the destination; then write aligned.

Sounds cool.

>  >f.) You'll have to get consensus amongst the concerned parties (and
>  >with the maintainer) that the trade-offs you're suggesting are
>  >appropriate.
> That's why I am discussing it here.

You should contact H.J Lu (via email and CC this mailing list) and ask
him if they made any progress with their 'first-time-called'
optimization checks idea.

Ryan S. Arnold
IBM Linux Technology Center
Linux Toolchain Development

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]