This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Optimized generic expf and exp2f

From: Arjan van de Ven <arjan at linux dot intel dot com>
To: Szabolcs Nagy <szabolcs dot nagy at arm dot com>, Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Joseph Myers <joseph at codesourcery dot com>
Cc: nd at arm dot com
Date: Wed, 6 Sep 2017 07:22:09 -0700
Subject: Re: [PATCH] Optimized generic expf and exp2f
Authentication-results: sourceware.org; auth=none
References: <DB6PR0801MB2053A12C3A4F7032C0107E1683960@DB6PR0801MB2053.eurprd08.prod.outlook.com> <DB6PR0801MB205311F53E0828F3F0B583F383970@DB6PR0801MB2053.eurprd08.prod.outlook.com> <1464883d-e2a0-082b-b844-3cd1a4a91820@linux.intel.com> <DB6PR0801MB20532AB59A6D7C8823ABDB6F83970@DB6PR0801MB2053.eurprd08.prod.outlook.com> <18cffe92-265d-4c43-99d4-dad579e0920e@linux.intel.com> <DB6PR0801MB20534AFC66B413361C102FDE83970@DB6PR0801MB2053.eurprd08.prod.outlook.com> <2a35a5be-e934-12c9-fbfd-8f38b0f57207@linux.intel.com> <59B002E2.9020505@arm.com>

On 9/6/2017 7:14 AM, Szabolcs Nagy wrote:

interesting; it takes 2 independent FP adds and a compare (in C) to detect nearest rounding
being in effect (which in time can overlap with the float->double conversion)
so if there's an option to reduce the algorithm by more than that for a fast
path...

(also, some CPUs (like newer Intel) support an instruction prefix encoding to force
rounding modes on a FP instruction independent of the global rounding mode,
which at some point maybe should be a gcc pragma or attribute or something,
and then used in such C code)


i don't think reducing the polynomial (from order 3 to order 2)
is possible without bigger lookup table, if less accuracy is
enough then reducing the table size is possible though:

poly order / table len / ulp error / non-nearest ulp error (rounded)
2          / 64        / 0.61      /
2          / 128       / 0.51      /
2          / 256       / 0.502     /
3          / 8         / 0.91      / > 10
3          / 16        / 0.526     / 2
3          / 32        / 0.502     / 1
3          / 64        / 0.5001    / 1
4          / 8         / 0.54      /
4          / 16        / 0.501     /
4          / 32        / 0.50004   /
4          / 64        / 0.5       /

the c code uses order=3/table=32, the x86_64 asm uses order=4/table=64


yeah I don't think it'll work out in terms of saving cycles; on Intel at least
FMA is 4 cycles, but an ADD is 4 cycles as well, so there's no net savings
by doing the 2xADD+compare to save an FMA.
(since the ADDs execute in parallel it's also not likely to be more expensive)

being able to force rounding might still be interesting  since it avoids the whole
right column of your table

Follow-Ups:
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Wilco Dijkstra

References:
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Wilco Dijkstra
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Wilco Dijkstra
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Arjan van de Ven
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Wilco Dijkstra
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Arjan van de Ven
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Wilco Dijkstra
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Arjan van de Ven
- Re: [PATCH] Optimized generic expf and exp2f
  - From: Szabolcs Nagy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]