Enhancing libm
Contents
1. Alternate Runtimes
One of the ways in which we can enhance libm is by providing alternate runtimes with guarantees that more closely match user requirements.
A user would select the alternate runtime by selecting it at link time via a compiler flag.
The proposed alternate runtimes are:
- Fast implementation
- Default implementation
- Precise implementation
See previous discussion, section 8, for more details of possible variants (but the details of default accuracy, exceptions and errno goals are now in the glibc manual, which supersedes the message linked from section 2). This proposes compile-time not link-time selection, with correctly rounded functions chosen by the user using names such as crsin rather than by a compile-time option at all.
1.1. Fast Implementation
The fast implementation would would allow results to be off by several ULPs.
The results are not required to be correctly rounded.
Selected by compiling with gcc and the use of the existing -ffast-math flag along with other math-related optimizations.
Selected by compiling with gcc and the use of the not-yet-implemented flag -ffast-libm.
1.2. Default Implementation
The default implementation is what we currently have with the IBM multi-precision code in the library.
Selected by compiling without any special options i.e. the default.
1.3. Precise Implementation
The precise implementation would strive to be strict IEEE correctly rounded despite the speed code.
Selected by compiling with gcc and the use of the not-yet-implemented -fprecise-libm.
1.4. Design
The alternate runtimes will be implemented as alternate entry points in the same library. The fast versions of the functions will be called __fast_* and the precise versions of the functions will be called __cr_* where cr stands for "correctly rounded." The existing implementation will continue to use the existing symbols.
The compiler will automatically rewrite a call to * with a call to __fast_* if the -ffast-libm option was given, likewise for -fprecise-libm.
By using alternate entry points we ensure that the compilation unit always runs with the code it was intended to run with and that the properties of libm can't easily be changed at runtime (ignoring interposition of a libm with __fast_* symbols which aren't fast).
Initially libm can alias all __fast_* and __cr_* symbols to the existing symbols, but we can migrate symbols to new versions when we provide fast or correctly rounded variants of the functions to the library.
2. Vector Entry Points
The math library may provide small-vector entry points to support reducing the cost of calling libm functions over multiple input arguments.
The compiler may collect several calls to a functions and then call an alternate vector entry point e.g. __vec4_fast_*.
2.1. Design
The math library must provide small-vector entry points for certain functions.
The compiler must be able to auto-vectorize calls with inputs into vector calls of the same functions.
Analysis needs to be done to decide exactly what size is beneficial.
The alignment of the types must be precisely documented.
Large vectors are excluded because they can more efficiently be handled by accelerators or other APIs.
3. Addition of x86_64 vector math functions to Glibc
3.1. Goal
Main goal is utilize of SIMD constructs in OpenMP4.0 (#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf and Cilk Plus (#6-8 in http://www.cilkplus.org/sites/default/files/open_specifications/Intel_Cilk_plus_lang_spec_1.2.htm) on x86_64 by adding vector implementations of several vector math functions (float and double versions).
3.2. What is vector math functions
Vector math functions are vector variants of corresponding scalar math operations implemented using SIMD ISA extensions (e.g. SSE or AVX). They take packed vector arguments, perform the operation on each element of the packed vector argument, and return a packed vector result. Using vector math functions is faster than repeatedly calling the scalar math routines. However, these vector versions differ from the scalar analogues in accuracy and behavior on special values. Functions are optimized for performance in their respective domains if processing doesn’t incur special values like denormal values, over- and under-flow, and out of range. Special values processing is done in a scalar fashion via respective scalar routine calls. Additionally functions like trigonometric may resort to scalar processing of huge (or other) arguments that do not necessarily cause special values, but rather require different and less SIMD-friendly handling.
These functions were tested (via reasonable random sampling) to pass 4-ulp maximum relative error criterion on their domains in round-to-nearest computation mode. Known limitations:
C99 compliance in terms of special values, errno:
a. Functions may not raise exceptions as required by C language standard. Functions may raise spurious exceptions. This is considered an artifact of SIMD processing and may be fixed in the future on the case-by-case basis.
b. Functions may not change errno in some of the required cases, e.g. if the SIMD friendly algorithm is done branch-free without a libm call for that value. This is done for performance reasons.
c. As the implementation is dependent on libm, some accuracy and special case problems may be inherent to this fact.
d. Functions do not guarantee fully correct results in computation modes different from round-to-nearest one.
3.3. Integration and usage model
Currently call to a vector math function could be created by GCC (from version 4.9.0) if developer used OMP SIMD constructs and -fopenmp passed.
These functions doesn’t set errno and have less accuracy so we need to require to set –ffast-math (which sets –fno-math-errno) to use them (or may be set –ffast-math under -fopenmp). Name of vector function created by GCC based on Vector Function ABI (http://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vector-Function-2012-v0.9.5.pdf) with a little differences at the moment (mainly b, c, d instead of x, y, Y), but we will fix that mangling.
For example the next code in file cos.c:
#pragma omp declare simd
extern double cos(double);
int N = 300;
double b[300];
double a[300];
int main(void) {
- int i;
#pragma omp simd
for (i = 0; i < N; i += 1) {
- b[i] = cos (a[i]);
- }
- return (0);
}
being built by gcc 4.9.0 with the following command: gcc ./cos.c -I/PATH_TO_GLIBC_INSTALL/include -L/ PATH_TO_GLIBC_INSTALL/lib/ -O1 -fopenmp -lm -mavx2 produces binary with nm a.out | grep ZGV
- U _ZGVdN4v_cos@@GLIBC_2.21
(here PATH_TO_GLIBC_INSTALL is path to Glibc already built with _ZGVdN4v_cos).
3.4. Testsuite
We plan to test vector functions based on testsuite for scalar ones. It will be done by wrappers which will combine argument values to vector registers, passing it to vector function, checking equality of elements in result vector and return extracted value. Only round-to-nearest rounding mode will be checked.
3.5. Consensus
a. Put new functions to glibc or to gcc?
- Consensus is that we will put them into glibc.
- We wouldn’t want to put it to gcc because that will restrict users of other compilers e.g. llvm.
b. Add new functions to libm or libmvec?
- Consensus is that we will put the new functions in libmvec.
- We would like to integrate new functions with libm because it is easier for developers to employ vectorization and should improve acceptance. The libmvec case affects compiler options and it seems new header need to be included instead of math.h, or is it OK include it in math.h?
- The use of libm.so as a linker srcipt with AS_NEEDED for libmvec.so does a lot to alleviate the need to link against libmvec. A new libmvec as a distinct API/ABI from libm allows the project to deploy an experimental libmvec for use by application developers and compiler writers without needing the same stability guarantees as libm. The libmvec library symbols could eventually be moved to libm if the project chooses, and thus a conservative approach is to use a distinct library.
3.6. Open questions
c. Glibc build requirements?
- For SSE4, AVX and AVX2 implementation it has been claimed that we don’t need to change Glibc build requirements. However, AVX2 support was added in binutils 2.22, so we need either to decide if it's OK to increase the binutils requirement, or to have a reason to expect the code to work with older versions.
d. How to handle other architectures having different sets of vectorized functions and possibly not having the same set of vector ISAs for each function?
- May be with help of some wrappers?
e. How to handle the case when Glibc has no vector function while application has it inside (due to old Glibc version installed)?
- This should never happen. An application must always run with the glibc it was compiled against or newer.