As part of my work on a paper on C Metaprogramming with gcc, I have implemented a sqort/insertion sort algorithm based off of the legacy glibc implementation (when msort is not used) who's performance far exceeds the current implementation. For example, I did a series of performance tests using these parameters: Phenom CPU gcc 4.9.2 16k bytes of data element sizes between 1 and 1024 bytes alignments between 1 and 32 bytes (where applicable) key of unsigned integer of either 1, 2, 4 or 8 bytes (depending upon element size, largest possible used) The greatest performance improvement was for 1 byte elements with 219% increase in speed and the worse was 1024 byte element size (both 2 and 8 bytes alignments) of only 26.78% increase. The average of this set was an increase of 82.94%. I've made a chart and put it here for now: http://loudmouth.javamonger.org/images/chart-16k-vs-msort.png. The code can be found here: https://github.com/daniel-santos/cmeta. Here is a break-down of the pros and cons if these techniques as I see it. The basic mechanism of C Metaprogramming is that many data values are known at the time a program is compiled. These techniques simply provide a way for gcc to use such values as "compile-time constants" and create a sort function specifically for sorting data of a specific element size, a specific element alignment, and a specific compare function. Since -findirect-inline, pointers to inline functions that are known to be constant at compile-time are able to compile-out the pointer and the function call and just inline the code. This inlining results in numerous additional optimization opportunities. Thus, at compile-time qsort_template() causes the generation of a sort function that is specific data type. Pros: Faster. Sometimes, much faster. Adds no extra code to glibc library share librarys. Cons: 1. Currently only works on recent GCC compilers (4.4+). It may be the case that other compilers (icc, cl.exe) supports extensions that can enable metaprogramming in C, but I have not ventured far enough into that. My hope is that the compiler capabilities required will eventually become part of the standard 2. Typically results in larger code size for the calling program. The function qsort_template is just a psudo-template function and isn't built until a piece of code that uses it is compiled. 3. Requires a basic C Metaprogramming library somewhere to avoid duplication of support functions (in the above project on github, this is currently contained in compiler.h). This library is mostly macros to verify that values are constant at compile-time, and add gcc attributes (always_inline, etc.) 4. I have not yet discovered a mechanism to force GCC to instantiate a "pseudo-template function" (as I call them) reliably and not inline everything – this results in bloat. This is most pronounced when __builtin_memcopy is expanded in multiple places. The code its self needs more cleanup to be ready to integrate into glibc, although the algorithm appears to now be quite efficient and error-free.
(In reply to Daniel Santos from comment #0) > As part of my work on a paper on C Metaprogramming with gcc, I have > implemented a sqort/insertion sort algorithm based off of the legacy glibc > implementation (when msort is not used) who's performance far exceeds the > current implementation. For example, I did a series of performance tests > using these parameters: > > Phenom CPU > gcc 4.9.2 > 16k bytes of data > element sizes between 1 and 1024 bytes > alignments between 1 and 32 bytes (where applicable) > key of unsigned integer of either 1, 2, 4 or 8 bytes (depending upon element > size, largest possible used) > > The greatest performance improvement was for 1 byte elements with 219% > increase in speed and the worse was 1024 byte element size (both 2 and 8 > bytes alignments) of only 26.78% increase. The average of this set was an > increase of 82.94%. I've made a chart and put it here for now: > http://loudmouth.javamonger.org/images/chart-16k-vs-msort.png. > > The code can be found here: https://github.com/daniel-santos/cmeta. Here is > a break-down of the pros and cons if these techniques as I see it. > > The basic mechanism of C Metaprogramming is that many data values are known > at the time a program is compiled. These techniques simply provide a way for > gcc to use such values as "compile-time constants" and create a sort > function specifically for sorting data of a specific element size, a > specific element alignment, and a specific compare function. Since > -findirect-inline, pointers to inline functions that are known to be > constant at compile-time are able to compile-out the pointer and the > function call and just inline the code. This inlining results in numerous > additional optimization opportunities. Thus, at compile-time > qsort_template() causes the generation of a sort function that is specific > data type. > > Pros: > Faster. Sometimes, much faster. > Adds no extra code to glibc library share librarys. > > Cons: > 1. Currently only works on recent GCC compilers (4.4+). It may be the case > that other compilers (icc, cl.exe) supports extensions that can enable > metaprogramming in C, but I have not ventured far enough into that. My hope > is that the compiler capabilities required will eventually become part of > the standard Building glibc requires gcc 4.6, so you should be fine? Unless you're saying that the compiled program requires gcc 4.6 or newer to take advantage of this? You could check for the compiler at build time in the headers and select a fallback. > 2. Typically results in larger code size for the calling program. The > function qsort_template is just a psudo-template function and isn't built > until a piece of code that uses it is compiled. How much larger? > 3. Requires a basic C Metaprogramming library somewhere to avoid duplication > of support functions (in the above project on github, this is currently > contained in compiler.h). This library is mostly macros to verify that > values are constant at compile-time, and add gcc attributes (always_inline, > etc.) License? Size? > 4. I have not yet discovered a mechanism to force GCC to instantiate a > "pseudo-template function" (as I call them) reliably and not inline > everything – this results in bloat. This is most pronounced when > __builtin_memcopy is expanded in multiple places. OK. > The code its self needs more cleanup to be ready to integrate into glibc, > although the algorithm appears to now be quite efficient and error-free. At a high level I like the novel approach. At a practical level the license of C metaprogramming library is a key point in the discussion. As is making glibc depend upon it. As are a lot of other questions like: runtime dependencies, build time dependencies, dependencies of the program being built with glibc headers that rely on this feature. I would strongly suggest posting to libc-alpha@sourceware.org before moving forward with any broader work to integrate with glibc. I would not want to see such good work go to waste.
(In reply to Carlos O'Donell from comment #1) > Building glibc requires gcc 4.6, so you should be fine? At this point, it doesn't matter what glibc is built with. > Unless you're saying that the compiled program requires gcc 4.6 or newer to > take advantage of this? Yes. In its current form, the program being compiled actually needs gcc 4.8+ or some such, but this isn't necessary. With the proper compiler-sniffing macros (which belong mostly in the "metaprogramming library" header(s)), it should build with any half-descent C compiler, both in optimized and unoptimized builds. The difference is that performance will be very poor unless it's a 4.4+ compiler. I haven't actually run the tests using the full gambit of compilers on this particular algorithm yet, but I have done it on a generic red-black tree that uses these same techniques and it essentially is poor prior to 4.4 and then gets good around 4.6 and great at 4.7. The generic red-black tree even builds and works on 3.6.2, but is huge and horrible. > You could check for the compiler at build time in the headers and select a > fallback. > > > 2. Typically results in larger code size for the calling program. The > > function qsort_template is just a psudo-template function and isn't built > > until a piece of code that uses it is compiled. > > How much larger? Well I should start with some some definitions for clarity. qsort_r - the current glibc qsort_r implementation in stdlib/msort.c _quicksort - the legacy / fallback qsort_r implementation stdlib/qsort.c qsort_template - my proposed pseudo-template functions So for comparison, qsort_r in my system glibc is 824 bytes. I'm building _quicksort in my benchmark program and that is 1214 bytes. There are some slight differences in the CFLAGS used for my system glibc and my benchmarks however: system glibc: CFLAGS="-march=native -g3 -O2 -fno-strict-aliasing -fno-stack-protector" my benchmarks: CFLAGS="-march=native -mtune=native -g3 -O2 -DNDEBUG -std=gnu11" So that being said, here are a few values for the generated code of qsort_template instantiations. I don't have this scripted (just picking this out of an objdump): Elem. Size Align Fn size in bytes 1 1 721 (0x2d1) 2 1 797 (0x31d) 2 2 754 (0x2f2) 4 1 753 (0x2f1) 4 2 732 (0x2dc) 4 4 732 (0x2dc) 8 1 785 (0x311) 8 8 731 (0x2db) 16 16 881 (0x371) 24 8 1140 (0x474) 32 16 1178 (0x49a) 32 32 1161 (0x489) 40 8 1361 (0x551) 64 32 1788 (0x6fc) 128 32 977 (0x3d1) 256 1 1361 (0x551) 256 2 1245 (0x4dd) 256 4 1105 (0x451) 256 8 977 (0x3d1) 256 16 977 (0x3d1) 256 32 977 (0x3d1) 512 32 977 (0x3d1) 1024 1 1435 (0x59b) 1024 2 1401 (0x579) 1024 5 1141 (0x475) 1024 8 981 (0x3d5) 1024 16 981 (0x3d5) 1024 32 981 (0x3d5) Some of this is bloat caused by attributes always_inline and flatten I'm using to force the use of values known at compile-time. Some of this can be mitigated with a if statement that directly calls memcpy when the size is smaller and calls a no_inline function (or simply one defined somewhere else without a compiler built-in) when the size is too large. The if statement would be compiled-out after optimizations. But all of that is sub-optimal - always better when we can let the compiler decide when to inline and when not to. gcc uses the optimizations flags as well as some tuning parameters (like max-inline-insns-single, etc.) which are fed via CFLAGS to make these decisions normally and we're only overriding them here as a mechanism to cause a sort of template instantiation. > > > 3. Requires a basic C Metaprogramming library somewhere to avoid duplication > > of support functions (in the above project on github, this is currently > > contained in compiler.h). This library is mostly macros to verify that > > values are constant at compile-time, and add gcc attributes (always_inline, > > etc.) > > License? Size? I do GPL3 whenever at all possible. Some of what I have in this project is GPL2 having been borrowed from the Linux kernel. However, much of it is my own property as they were my commits, so I can reuse them anywhere I want. :) (not that I intend to stop working on the kernel) As to the size, I presume that you mean the compiled size. Being a metaprogramming library, there is not yet any of it that will add to the size of libc.so, although I can easily foresee the addition of utility functions that would be built with glibc (usually tiny). This "metaprogramming library" can be thought of as somewhat analogous to the boost library for C++, most of which is headers containing C++ templates. > At a high level I like the novel approach. > > At a practical level the license of C metaprogramming library is a key point > in the discussion. As is making glibc depend upon it. As are a lot of other > questions like: runtime dependencies, build time dependencies, dependencies > of the program being built with glibc headers that rely on this feature. No runtime dependencies. No built-time dependencies aside from the compiler. Programs being built with glibc headers that use this (or similar) extensions will only need to be using a recent gcc compiler *for good results.* If the metaprogramming library (headers) are written correctly and tested with a good enough array of compilers, the programs built using this should still compile and run correctly on other compilers, but will perform poorly. In fact, I guess there's no reason why the qsort_template function can't simply be replaced when a good gcc compiler isn't being used. Basically, a qsort_r compatible compar function pointer would need to be added to struct qsort_def and then just do a #ifdef #ifndef _GLIBC_SOME_VALUE static inline int qsort_template (const struct qsort_def *def, void *const pbase, size_t n, void *arg) { qsort_r(pbase, n, def->size, def->compar); return 0; } #else / * the qsort_template definition... */ #endif This way, if the programmer wants their program to have good performance when not built with a recent gcc compiler, they need only populate the struct's compar member. The reason I'm using less instead of compar basically comes down to 1.) no existing standard mandating a compar function, 2.) we don't care about equality 3.) less() is the standard in C++ libraries where equality is irrelevant, and 4.) the fact that current gcc fails to properly optimize the results of a compare function (that returns -1, 0 or 1) resulting in one or two more instructions than necessary (I have a gcc bug open for this somewhere). It isn't a huge difference, but I got into the habit of using less whenever equality doesn't matter. > > I would strongly suggest posting to libc-alpha@sourceware.org before moving > forward with any broader work to integrate with glibc. I would not want to > see such good work go to waste. I most certainly will.
The gcc bug I was referring to (the less() vs compar() issue) is here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829.
(In reply to Daniel Santos from comment #2) > > License? Size? > > I do GPL3 whenever at all possible. Some of what I have in this project is > GPL2 having been borrowed from the Linux kernel. However, much of it is my > own property as they were my commits, so I can reuse them anywhere I want. > :) (not that I intend to stop working on the kernel) This is a serious problem for glibc. We use LGPLv2, and unfortunately users in the community expect that license, and we will likely never move from that position. Are you interested and able to relicense as LGPLv2?
(In reply to Carlos O'Donell from comment #4) > (In reply to Daniel Santos from comment #2) > > > License? Size? > > > > I do GPL3 whenever at all possible. Some of what I have in this project is > > GPL2 having been borrowed from the Linux kernel. However, much of it is my > > own property as they were my commits, so I can reuse them anywhere I want. > > :) (not that I intend to stop working on the kernel) > > This is a serious problem for glibc. We use LGPLv2, and unfortunately users > in the community expect that license, and we will likely never move from > that position. Are you interested and able to relicense as LGPLv2? Sorry for my slow response, I've been a bit busy. So LGPLv2 makes sense for such a library, I do recall this now. It would mean that a few items in the metaprogramming library be re-written, but there isn't really that much anyway. I would rather that then the effort be split in between LGPL-like licenses and GPL. Also, I finally scripted the function size analysis and I'm quite surprised! The function never seems to grow larger than 1805 bytes (in the case of unaligned 64 byte elements) and once the element size reaches 1k, gcc is indeed making a very pretty rep movqs loop as I had hoped it would on x86. If these are aligned to at least 8 bytes, the function size is always 981 when element size is a power of two (otherwise it's just a few bytes more). Elem size Align Function Size 1 1 710 2 1 797 2 2 754 4 1 744 4 2 732 4 4 732 8 1 780 8 2 736 8 4 736 8 8 736 16 1 897 16 2 874 16 4 874 16 8 874 16 16 874 24 1 1104 24 2 1140 24 4 1140 24 8 1140 32 1 1193 32 2 1178 32 4 1178 32 8 1178 32 16 1178 32 32 1178 36 1 1360 36 2 1348 36 4 1348 40 1 1367 40 2 1361 40 4 1361 40 8 1361 48 1 1578 48 2 1560 48 4 1560 48 8 1560 48 16 1560 64 1 1805 64 2 1788 64 4 1788 64 8 1788 64 16 1788 64 32 1788 128 1 1316 128 2 1237 128 4 1090 128 8 976 128 16 976 128 32 976 256 1 1351 256 2 1245 256 4 1090 256 8 976 256 16 976 256 32 976 512 1 1351 512 2 1245 512 4 1090 512 8 976 512 16 976 512 32 976 1024 1 1435 1024 2 1401 1024 4 1141 1024 8 981 1024 16 981 1024 32 981 5120 1 1435 5120 2 1409 5120 4 1173 5120 8 995 5120 16 995 5120 32 995