This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v2] x86-64: Optimize strcmp/wcscmp with AVX2
On Sat, 2018-06-02 at 10:44 +0300, Alexander Monakov wrote:
> On Fri, 1 Jun 2018, Leonardo Sandoval wrote:
> > this is partially true for AVX2 FMA and AVX512. What I am proposing
> > contains none of the latter instructions, just AVX2 without FMA
> > instructions.
>
> This would address my concern (if true for all CPUs), but ...
>
> > In the other hand, some microbenchmarks were done to see the
> > benefit of
> > this effort, which is resumed on the commit description but the
> > complete picture is here
>
> this does not. The whole point was that frequency behavior means the
> slowdown on programs making *occasional* calls to strcmp will not be
> captured by microbenchmarks. What good is saving dozens of cycles on
> strcmp calls if the remaining program is slowed down by 5%?
>
right, perhaps microbenchmarks does not tell us much on this case
because AVX and non-AVX is not mixed. Also, if you look at the patch,
upper ymm bits are cleared (vzeroupper) before returning from strcmp,
thus there is no perf penalty in storing these and then restoring when
other AVX code is called again.
As I said before, using strcmp wont hurt performance at all (internal
HW perf team confirmed what I said) because we are not using any opcode
that that may drop frequency.
if you have a test scenario to prove the 5% drop, I would like to
test it and discuss it further.
> I was missing that AVX frequency limits kick in only if "heavy"
> operations
> are used -- on recent generations. I'm not sure that's true for
> older, e.g.
> Haswell, generations. Intel's white paper explaining Haswell AVX
> clocks
> makes no distinction of "light" vs. "heavy" operations:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-pa
> pers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf
>
> Can you please clarify further?
>
> Alexander