Enhancing libm

1. Alternate Runtimes

One of the ways in which we can enhance libm is by providing alternate runtimes with guarantees that more closely match user requirements.

A user would select the alternate runtime by selecting it at link time via a compiler flag.

The proposed alternate runtimes are:

See previous discussion, section 8, for more details of possible variants (but the details of default accuracy, exceptions and errno goals are now in the glibc manual, which supersedes the message linked from section 2). This proposes compile-time not link-time selection, with correctly rounded functions chosen by the user using names such as crsin rather than by a compile-time option at all.

1.1. Fast Implementation

The fast implementation would would allow results to be off by several ULPs.

The results are not required to be correctly rounded.

Selected by compiling with gcc and the use of the existing -ffast-math flag along with other math-related optimizations.

Selected by compiling with gcc and the use of the not-yet-implemented flag -ffast-libm.

1.2. Default Implementation

The default implementation is what we currently have with the IBM multi-precision code in the library.

Selected by compiling without any special options i.e. the default.

1.3. Precise Implementation

The precise implementation would strive to be strict IEEE correctly rounded despite the speed code.

Selected by compiling with gcc and the use of the not-yet-implemented -fprecise-libm.

1.4. Design

The alternate runtimes will be implemented as alternate entry points in the same library. The fast versions of the functions will be called __fast_* and the precise versions of the functions will be called __cr_* where cr stands for "correctly rounded." The existing implementation will continue to use the existing symbols.

The compiler will automatically rewrite a call to * with a call to __fast_* if the -ffast-libm option was given, likewise for -fprecise-libm.

By using alternate entry points we ensure that the compilation unit always runs with the code it was intended to run with and that the properties of libm can't easily be changed at runtime (ignoring interposition of a libm with __fast_* symbols which aren't fast).

Initially libm can alias all __fast_* and __cr_* symbols to the existing symbols, but we can migrate symbols to new versions when we provide fast or correctly rounded variants of the functions to the library.

2. Vector Entry Points

The math library may provide small-vector entry points to support reducing the cost of calling libm functions over multiple input arguments.

The compiler may collect several calls to a functions and then call an alternate vector entry point e.g. __vec4_fast_*.

2.1. Design

The math library must provide small-vector entry points for certain functions.

The compiler must be able to auto-vectorize calls with inputs into vector calls of the same functions.

Analysis needs to be done to decide exactly what size is beneficial.

The alignment of the types must be precisely documented.

Large vectors are excluded because they can more efficiently be handled by accelerators or other APIs.

3. Addition of x86_64 vector math functions to Glibc

3.1. Goal

Main goal is to improve vectorization of GCC with OpenMP4.0 SIMD constructs (#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf and Cilk Plus constructs (#6-8 in http://www.cilkplus.org/sites/default/files/open_specifications/Intel_Cilk_plus_lang_spec_1.2.htm) on x86_64 by adding SSE4, AVX and AVX2 vector implementations of several vector math functions (float and double versions). AVX-512 versions are planned to be added later. These functions can be also used manually (with intrinsics) by developers to obtain speedup.

3.2. What is vector math functions

Vector math functions are vector variants of corresponding scalar math operations using SSE4, AVX and AVX2 instruction sets. They take packed vector arguments, perform the operation on each element of the packed vector argument, and return a packed vector result. Using vector math functions is faster than repeatedly calling the scalar math routines. However, these vector versions differ from the scalar analogues in accuracy and behavior on special values. Functions are optimized for performance in their respective domains if processing doesn’t incur special values like denormal values, over- and under-flow, and out of range. Special values processing is done in a scalar fashion via respective scalar routine calls. Additionally functions like trigonometric may resort to scalar processing of huge (or other) arguments that do not necessarily cause special values, but rather require different and less SIMD-friendly handling.

These functions were tested (via reasonable random sampling) to pass 4-ulp maximum relative error criterion on their domains in round-to-nearest computation mode. Known limitations:

C99 compliance in terms of special values, errno:

a. Functions may not raise exceptions as required by C language standard. Functions may raise spurious exceptions. This is considered an artifact of SIMD processing and may be fixed in the future on the case-by-case basis.

b. Functions may not change errno in some of the required cases, e.g. if the SIMD friendly algorithm is done branch-free without a libm call for that value. This is done for performance reasons.

c. As the implementation is dependent on libm, some accuracy and special case problems may be inherent to this fact.

d. Functions do not guarantee fully correct results in computation modes different from round-to-nearest one.

3.3. Integration and usage model

Currently call to a vector math function could be created by GCC (from version 4.9.0) if developer used OMP SIMD constructs and -fopenmp passed.

These functions doesn’t set errno and have less accuracy so we need to require to set –ffast-math (which sets –fno-math-errno) to use them (or may be set –ffast-math under -fopenmp). Name of vector function created by GCC based on Vector Function ABI (http://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vector-Function-2012-v0.9.5.pdf) with a little differences at the moment (mainly b, c, d instead of x, y, Y), but we will fix that mangling.

For example the next code in file cos.c:

#pragma omp declare simd

extern double cos(double);

int N = 300;

double b[300];

double a[300];

int main(void) {

#pragma omp simd


being built by gcc 4.9.0 with the following command: gcc ./cos.c -I/PATH_TO_GLIBC_INSTALL/include -L/ PATH_TO_GLIBC_INSTALL/lib/ -O1 -fopenmp -lm -mavx2 produces binary with nm a.out | grep ZGV

(here PATH_TO_GLIBC_INSTALL is path to Glibc already built with _ZGVdN4v_cos).

3.4. Testsuite

We plan to test vector functions based on testsuite for scalar ones. It will be done by wrappers which will combine argument values to vector registers, passing it to vector function, checking equality of elements in result vector and return extracted value. Only round-to-nearest rounding mode will be checked.

3.5. Open questions

a. Put new functions to Glibc or to GCС?

b. Libm or libmvec?

c. Glibc build requirements?

d. How to handle other architectures having different sets of vectorized functions and possibly not having the same set of vector ISAs for each function?

e. How to handle the case when Glibc has no vector function while application has it inside (due to old Glibc version installed)?