Twiddling with 64-bit values as 2 ints;

Mon Aug 23 14:11:28 GMT 2021

On 23/08/2021 10:18, Stefan Kanthak wrote:
> Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> 
>> On 21/08/2021 10:34, Stefan Kanthak wrote:
>>>
>>> (Heretic.-) questions:
>>> - why does glibc still employ such ugly code?
>>> - Why doesn't glibc take advantage of 64-bit integers in such code?
>>
>> Because no one cared to adjust the implementation.  Recently Wilco
>> has removed a lot of old code that still uses 32-bit instead of 64-bit
>> bo bit twinddling in floating-pointer implementation (check caa884dda7
>> and 9e97f239eae1f2).
> 
> That's good to hear.
> 
>> I think we should move to use a simplest code assuming 64-bit CPU
> 
> D'accord.
> And there's a second direction where you might move: almost all CPUs
> have separate general purpose registers and floating-point registers.
> Bit-twiddling generally needs extra (and sometimes slow) transfers
> between them.
> In 32-bit environment, where arguments are typically passed on the
> stack, at least loading an argument from the stack into a GPR or FPR
> makes no difference.
> In 64-bit environment, where arguments are passed in registers, they
> should be operated on in these registers.
> 
> So: why not implement routines like nextafter() without bit-twiddling,
> using floating-point as far as possible for architectures where this
> gives better results?

Mainly because some math routines are not performance critical in the
sense they are usually not hotspots and for these I would prefer the 
simplest code that work with reasonable performance independently of
the underlying ABI or architecture (using integer operation might be
be for soft-fp ABI for instance).

For symbols are might be performance critical, we do have more optimized
version.  Szabolcs and Wilco spent considerable time to tune a lot of
math functions and to remove the slow code path; also for some routines
we have internal defines that map then to compiler builtin when we know
that compiler and architecture allows us to do so (check the rounding
routines or sqrt for instance).

Recently we are aiming to avoid arch-specific code for complex routines,
and prefer C implementation that leverage the compiler support.  It makes
a *much* maintainable code and without the need to keep evaluating the 
routines on each architecture new iterations (as some routines proven to
be slower than more well coded generic implementation).

> 
> The simple implementation I showed in my initial post improved the
> throughput in my benchmark (on AMD64) by an order of magnitude.
> In Szabolcs Nagy benchmark measuring latency it took 0.04ns/call
> longer (5.72ns vs. 5.68ns) -- despite the POOR job GCC does on FP.

Your implementation triggered a lot of regression, you will need to sort
this out before considering performance numbers.  Also, we will need
a proper benchmark to evaluate it, as Szabolcs and Wilco has done for
their math work.

> 
> Does GLIBC offer a macro like "PREFER_FP_IMPLEMENTATION" that can be
> used to select between the integer bit-twiddling code and FP-preferring
> code during compilation?

No and I don't think we this would be a good addition.  As before, I would
prefer to have a simple generic implementation that give us a good
performance on modern hardware instead of a configurable one with many
tunables.  The later is increases the maintainable cost (with testing and
performance evaluation).

> 
>> and let the compiler optimize it (which unfortunately gcc is not that
>> smart in all the cases).
> 
> I know, and I just learned that GCC does NOT perform quite some
> optimisations I expect from a mature compiler.
> Quoting Jakub Jelinek on gcc@gcc.gnu.org:
> 
> | GCC doesn't do value range propagation of floating point values, not
> | even the special ones like NaNs, infinities, +/- zeros etc., and without
> | that the earlier ifs aren't taken into account for the earlier code.
> 
> The code I used to demonstrate this deficiency is TOMS 722...
> 
> Stefan
>