Created attachment 14722 [details] A testcase Performance of generic fmod and remainder highly depends on input values. Performance of x87 (fprem/fprem1) implementations can be much faster. On Intel Coffee Lake, [hjl@gnu-cfl-3 fmod-2]$ make gcc -O2 -c -o test.o test.c gcc -c -o x87.o x87.S gcc -static -o x87 test.o x87.o gcc -static -o sse test.o -lm time ./sse 3.15user 0.00system 0:03.15elapsed 99%CPU (0avgtext+0avgdata 684maxresident)k 0inputs+0outputs (0major+39minor)pagefaults 0swaps time ./x87 0.25user 0.00system 0:00.25elapsed 99%CPU (0avgtext+0avgdata 680maxresident)k 0inputs+0outputs (0major+37minor)pagefaults 0swaps [hjl@gnu-cfl-3 fmod-2]$