Bug 17941 - Propose C Metaprogramming-based qsort as GNU extension
Summary: Propose C Metaprogramming-based qsort as GNU extension
Status: WAITING
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: unspecified
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-06 23:18 UTC by Daniel Santos
Modified: 2015-02-16 16:44 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Santos 2015-02-06 23:18:26 UTC
As part of my work on a paper on C Metaprogramming with gcc, I have implemented a sqort/insertion sort algorithm based off of the legacy glibc implementation (when msort is not used) who's performance far exceeds the current implementation. For example, I did a series of performance tests using these parameters:

Phenom CPU
gcc 4.9.2
16k bytes of data
element sizes between 1 and 1024 bytes
alignments between 1 and 32 bytes (where applicable)
key of unsigned integer of either 1, 2, 4 or 8 bytes (depending upon element size, largest possible used)

The greatest performance improvement was for 1 byte elements with 219% increase in speed and the worse was 1024 byte element size (both 2 and 8 bytes alignments) of only 26.78% increase. The average of this set was an increase of 82.94%. I've made a chart and put it here for now: http://loudmouth.javamonger.org/images/chart-16k-vs-msort.png.

The code can be found here: https://github.com/daniel-santos/cmeta. Here is a break-down of the pros and cons if these techniques as I see it.

The basic mechanism of C Metaprogramming is that many data values are known at the time a program is compiled. These techniques simply provide a way for gcc to use such values as "compile-time constants" and create a sort function specifically for sorting data of a specific element size, a specific element alignment, and a specific compare function. Since -findirect-inline, pointers to inline functions that are known to be constant at compile-time are able to compile-out the pointer and the function call and just inline the code. This inlining results in numerous additional optimization opportunities. Thus, at compile-time qsort_template() causes the generation of a sort function that is specific data type.

Pros:
Faster. Sometimes, much faster.
Adds no extra code to glibc library share librarys.

Cons:
1. Currently only works on recent GCC compilers (4.4+). It may be the case that other compilers (icc, cl.exe) supports extensions that can enable metaprogramming in C, but I have not ventured far enough into that. My hope is that the compiler capabilities required will eventually become part of the standard

2. Typically results in larger code size for the calling program. The function qsort_template is just a psudo-template function and isn't built until a piece of code that uses it is compiled.

3. Requires a basic C Metaprogramming library somewhere to avoid duplication of support functions (in the above project on github, this is currently contained in compiler.h). This library is mostly macros to verify that values are constant at compile-time, and add gcc attributes (always_inline, etc.)

4. I have not yet discovered a mechanism to force GCC to instantiate a "pseudo-template function" (as I call them) reliably and not inline everything – this results in bloat. This is most pronounced when __builtin_memcopy is expanded in multiple places.

The code its self needs more cleanup to be ready to integrate into glibc, although the algorithm appears to now be quite efficient and error-free.
Comment 1 Carlos O'Donell 2015-02-07 02:54:28 UTC
(In reply to Daniel Santos from comment #0)
> As part of my work on a paper on C Metaprogramming with gcc, I have
> implemented a sqort/insertion sort algorithm based off of the legacy glibc
> implementation (when msort is not used) who's performance far exceeds the
> current implementation. For example, I did a series of performance tests
> using these parameters:
> 
> Phenom CPU
> gcc 4.9.2
> 16k bytes of data
> element sizes between 1 and 1024 bytes
> alignments between 1 and 32 bytes (where applicable)
> key of unsigned integer of either 1, 2, 4 or 8 bytes (depending upon element
> size, largest possible used)
> 
> The greatest performance improvement was for 1 byte elements with 219%
> increase in speed and the worse was 1024 byte element size (both 2 and 8
> bytes alignments) of only 26.78% increase. The average of this set was an
> increase of 82.94%. I've made a chart and put it here for now:
> http://loudmouth.javamonger.org/images/chart-16k-vs-msort.png.
> 
> The code can be found here: https://github.com/daniel-santos/cmeta. Here is
> a break-down of the pros and cons if these techniques as I see it.
> 
> The basic mechanism of C Metaprogramming is that many data values are known
> at the time a program is compiled. These techniques simply provide a way for
> gcc to use such values as "compile-time constants" and create a sort
> function specifically for sorting data of a specific element size, a
> specific element alignment, and a specific compare function. Since
> -findirect-inline, pointers to inline functions that are known to be
> constant at compile-time are able to compile-out the pointer and the
> function call and just inline the code. This inlining results in numerous
> additional optimization opportunities. Thus, at compile-time
> qsort_template() causes the generation of a sort function that is specific
> data type.
> 
> Pros:
> Faster. Sometimes, much faster.
> Adds no extra code to glibc library share librarys.
> 
> Cons:
> 1. Currently only works on recent GCC compilers (4.4+). It may be the case
> that other compilers (icc, cl.exe) supports extensions that can enable
> metaprogramming in C, but I have not ventured far enough into that. My hope
> is that the compiler capabilities required will eventually become part of
> the standard

Building glibc requires gcc 4.6, so you should be fine?

Unless you're saying that the compiled program requires gcc 4.6 or newer to take advantage of this?

You could check for the compiler at build time in the headers and select a fallback.

> 2. Typically results in larger code size for the calling program. The
> function qsort_template is just a psudo-template function and isn't built
> until a piece of code that uses it is compiled.

How much larger?

> 3. Requires a basic C Metaprogramming library somewhere to avoid duplication
> of support functions (in the above project on github, this is currently
> contained in compiler.h). This library is mostly macros to verify that
> values are constant at compile-time, and add gcc attributes (always_inline,
> etc.)

License? Size?

> 4. I have not yet discovered a mechanism to force GCC to instantiate a
> "pseudo-template function" (as I call them) reliably and not inline
> everything – this results in bloat. This is most pronounced when
> __builtin_memcopy is expanded in multiple places.

OK.

> The code its self needs more cleanup to be ready to integrate into glibc,
> although the algorithm appears to now be quite efficient and error-free.

At a high level I like the novel approach.

At a practical level the license of C metaprogramming library is a key point in the discussion. As is making glibc depend upon it. As are a lot of other questions like: runtime dependencies, build time dependencies, dependencies of the program being built with glibc headers that rely on this feature.

I would strongly suggest posting to libc-alpha@sourceware.org before moving forward with any broader work to integrate with glibc. I would not want to see such good work go to waste.
Comment 2 Daniel Santos 2015-02-09 22:45:16 UTC
(In reply to Carlos O'Donell from comment #1)
> Building glibc requires gcc 4.6, so you should be fine?

At this point, it doesn't matter what glibc is built with.

> Unless you're saying that the compiled program requires gcc 4.6 or newer to
> take advantage of this?

Yes. In its current form, the program being compiled actually needs gcc 4.8+ or some such, but this isn't necessary. With the proper compiler-sniffing macros (which belong mostly in the "metaprogramming library" header(s)), it should build with any half-descent C compiler, both in optimized and unoptimized builds. The difference is that performance will be very poor unless it's a 4.4+ compiler. I haven't actually run the tests using the full gambit of compilers on this particular algorithm yet, but I have done it on a generic red-black tree that uses these same techniques and it essentially is poor prior to 4.4 and then gets good around 4.6 and great at 4.7. The generic red-black tree even builds and works on 3.6.2, but is huge and horrible.

> You could check for the compiler at build time in the headers and select a
> fallback.
> 
> > 2. Typically results in larger code size for the calling program. The
> > function qsort_template is just a psudo-template function and isn't built
> > until a piece of code that uses it is compiled.
> 
> How much larger?

Well I should start with some some definitions for clarity.

qsort_r        - the current glibc qsort_r implementation in stdlib/msort.c
_quicksort     - the legacy / fallback qsort_r implementation stdlib/qsort.c
qsort_template - my proposed pseudo-template functions

So for comparison, qsort_r in my system glibc is 824 bytes. I'm building _quicksort in my benchmark program and that is 1214 bytes. There are some slight differences in the CFLAGS used for my system glibc and my benchmarks however:

system glibc:
    CFLAGS="-march=native -g3 -O2 -fno-strict-aliasing -fno-stack-protector"

my benchmarks:
    CFLAGS="-march=native -mtune=native -g3 -O2 -DNDEBUG -std=gnu11"

So that being said, here are a few values for the generated code of qsort_template instantiations. I don't have this scripted (just picking this out of an objdump):

Elem. Size  Align     Fn size in bytes
         1      1     721 (0x2d1)
         2      1     797 (0x31d)
         2      2     754 (0x2f2)
         4      1     753 (0x2f1)
         4      2     732 (0x2dc)
         4      4     732 (0x2dc)
         8      1     785 (0x311)
         8      8     731 (0x2db)
        16     16     881 (0x371)
        24      8    1140 (0x474)
        32     16    1178 (0x49a)
        32     32    1161 (0x489)
        40      8    1361 (0x551)
        64     32    1788 (0x6fc)
       128     32     977 (0x3d1)
       256      1    1361 (0x551)
       256      2    1245 (0x4dd)
       256      4    1105 (0x451)
       256      8     977 (0x3d1)
       256     16     977 (0x3d1)
       256     32     977 (0x3d1)
       512     32     977 (0x3d1)
      1024      1    1435 (0x59b)
      1024      2    1401 (0x579)
      1024      5    1141 (0x475)
      1024      8     981 (0x3d5)
      1024     16     981 (0x3d5)
      1024     32     981 (0x3d5)

Some of this is bloat caused by attributes always_inline and flatten I'm using to force the use of values known at compile-time. Some of this can be mitigated with a if statement that directly calls memcpy when the size is smaller and calls a no_inline function (or simply one defined somewhere else without a compiler built-in) when the size is too large. The if statement would be compiled-out after optimizations. But all of that is sub-optimal - always better when we can let the compiler decide when to inline and when not to. gcc uses the optimizations flags as well as some tuning parameters (like max-inline-insns-single, etc.) which are fed via CFLAGS to make these decisions normally and we're only overriding them here as a mechanism to cause a sort of template instantiation.

> 
> > 3. Requires a basic C Metaprogramming library somewhere to avoid duplication
> > of support functions (in the above project on github, this is currently
> > contained in compiler.h). This library is mostly macros to verify that
> > values are constant at compile-time, and add gcc attributes (always_inline,
> > etc.)
> 
> License? Size?

I do GPL3 whenever at all possible. Some of what I have in this project is GPL2 having been borrowed from the Linux kernel. However, much of it is my own property as they were my commits, so I can reuse them anywhere I want. :) (not that I intend to stop working on the kernel)

As to the size, I presume that you mean the compiled size. Being a metaprogramming library, there is not yet any of it that will add to the size of libc.so, although I can easily foresee the addition of utility functions that would be built with glibc (usually tiny). This "metaprogramming library" can be thought of as somewhat analogous to the boost library for C++, most of which is headers containing C++ templates.


> At a high level I like the novel approach.
> 
> At a practical level the license of C metaprogramming library is a key point
> in the discussion. As is making glibc depend upon it. As are a lot of other
> questions like: runtime dependencies, build time dependencies, dependencies
> of the program being built with glibc headers that rely on this feature.

No runtime dependencies.
No built-time dependencies aside from the compiler.
Programs being built with glibc headers that use this (or similar) extensions will only need to be using a recent gcc compiler *for good results.* If the metaprogramming library (headers) are written correctly and tested with a good enough array of compilers, the programs built using this should still compile and run correctly on other compilers, but will perform poorly.  In fact, I guess there's no reason why the qsort_template function can't simply be replaced when a good gcc compiler isn't being used. Basically, a qsort_r compatible compar function pointer would need to be added to struct qsort_def and then just do a #ifdef

#ifndef _GLIBC_SOME_VALUE

static inline int
qsort_template (const struct qsort_def *def, void *const pbase,
                size_t n, void *arg)
{
  qsort_r(pbase, n, def->size, def->compar);
  return 0;
}

#else
/ * the qsort_template definition... */
#endif

This way, if the programmer wants their program to have good performance when not built with a recent gcc compiler, they need only populate the struct's compar member. The reason I'm using less instead of compar basically comes down to

1.) no existing standard mandating a compar function,
2.) we don't care about equality
3.) less() is the standard in C++ libraries where equality is irrelevant, and
4.) the fact that current gcc fails to properly optimize the results of a compare function (that returns -1, 0 or 1) resulting in one or two more instructions than necessary (I have a gcc bug open for this somewhere). It isn't a huge difference, but I got into the habit of using less whenever equality doesn't matter.


> 
> I would strongly suggest posting to libc-alpha@sourceware.org before moving
> forward with any broader work to integrate with glibc. I would not want to
> see such good work go to waste.

I most certainly will.
Comment 3 Daniel Santos 2015-02-10 02:16:39 UTC
The gcc bug I was referring to (the less() vs compar() issue) is here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829.
Comment 4 Carlos O'Donell 2015-02-10 15:46:51 UTC
(In reply to Daniel Santos from comment #2)
> > License? Size?
> 
> I do GPL3 whenever at all possible. Some of what I have in this project is
> GPL2 having been borrowed from the Linux kernel. However, much of it is my
> own property as they were my commits, so I can reuse them anywhere I want.
> :) (not that I intend to stop working on the kernel)

This is a serious problem for glibc. We use LGPLv2, and unfortunately users in the community expect that license, and we will likely never move from that position. Are you interested and able to relicense as LGPLv2?
Comment 5 Daniel Santos 2015-02-13 20:25:02 UTC
(In reply to Carlos O'Donell from comment #4)
> (In reply to Daniel Santos from comment #2)
> > > License? Size?
> > 
> > I do GPL3 whenever at all possible. Some of what I have in this project is
> > GPL2 having been borrowed from the Linux kernel. However, much of it is my
> > own property as they were my commits, so I can reuse them anywhere I want.
> > :) (not that I intend to stop working on the kernel)
> 
> This is a serious problem for glibc. We use LGPLv2, and unfortunately users
> in the community expect that license, and we will likely never move from
> that position. Are you interested and able to relicense as LGPLv2?

Sorry for my slow response, I've been a bit busy. So LGPLv2 makes sense for such a library, I do recall this now. It would mean that a few items in the metaprogramming library be re-written, but there isn't really that much anyway. I would rather that then the effort be split in between LGPL-like licenses and GPL.

Also, I finally scripted the function size analysis and I'm quite surprised! The function never seems to grow larger than 1805 bytes (in the case of unaligned 64 byte elements) and once the element size reaches 1k, gcc is indeed making a very pretty rep movqs loop as I had hoped it would on x86. If these are aligned to at least 8 bytes, the function size is always 981 when element size is a power of two (otherwise it's just a few bytes more).

Elem size   Align       Function Size
1           1           710
2           1           797
2           2           754
4           1           744
4           2           732
4           4           732
8           1           780
8           2           736
8           4           736
8           8           736
16          1           897
16          2           874
16          4           874
16          8           874
16          16          874
24          1           1104
24          2           1140
24          4           1140
24          8           1140
32          1           1193
32          2           1178
32          4           1178
32          8           1178
32          16          1178
32          32          1178
36          1           1360
36          2           1348
36          4           1348
40          1           1367
40          2           1361
40          4           1361
40          8           1361
48          1           1578
48          2           1560
48          4           1560
48          8           1560
48          16          1560
64          1           1805
64          2           1788
64          4           1788
64          8           1788
64          16          1788
64          32          1788
128         1           1316
128         2           1237
128         4           1090
128         8           976
128         16          976
128         32          976
256         1           1351
256         2           1245
256         4           1090
256         8           976
256         16          976
256         32          976
512         1           1351
512         2           1245
512         4           1090
512         8           976
512         16          976
512         32          976
1024        1           1435
1024        2           1401
1024        4           1141
1024        8           981
1024        16          981
1024        32          981
5120        1           1435
5120        2           1409
5120        4           1173
5120        8           995
5120        16          995
5120        32          995