[PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c

Adhemerval Zanella Netto adhemerval.zanella@linaro.org
Tue Mar 5 15:25:36 GMT 2024



On 29/02/24 17:19, Alexander Monakov wrote:
> 
> On Fri, 23 Feb 2024, Adhemerval Zanella Netto wrote:
> 
>> Alexandre, are you reservation about this optimization related to extra code
>> and data required to optimize for a limited input range?
> 
> No, my concern is more general. As I see it, Noah is offering target-specific
> feedback without making it clear whether he is deferring high-level decisions
> to someone else, or taking the responsibility for them himself (and giving
> an implicit ack by jumping straight to technical review). But as Rich said,
> high-level review really need to be done before the patch is rerolled to v8
> on coding style and other miscellanea. That includes:
> 
> 1. "Is this desirable on the high level?" The people who initially bear
> the cost of mistakes are the users (who did not ask for an AVX2 memmem
> in the first place) and distribution maintainers who triage the issues. Adding
> a new SIMD variant to Glibc is not without cost. Why is it important that
> Glibc carries an AVX2 memmem which achieves only a 2x speedup according to
> microbenchmark provided by the submitter, despite using 32-byte vectors?
> Shouldn't it aim for a 32x speedup over the generic implementation?
> Would you entertain AVX-512 strfry and memfrob?

I tend to agree and I was outvoted when Intel proposed a SSE/AVX2 optimized
strcat implementation (specially because we already have optimized strlen
and strcpy, and strcat is also a bad interface).

But for memmem/strstr SIMD version I don't have strong opinion, nor which
speedup threshold we should aim for inclusion. I tend to agree with you that 
a 2x speedup with a limited haystack size for such code complexity
is not really ideal.  

> 
> 2. "Is the algorithm correct?"
> 
> 3. "Is the algorithm efficient?" (big-O time and space complexity)

Also agree, and I think we already have previous discussion before that
inefficient implementations should be not accepted, specially when the
generic implementation does not show the deficiency. 

> 
> 4. "Are the risks of bugs and regressions acceptable?"
> 
> 5. "Are there any potential security issues?"
> 
> 6. "Are the size and energy trade-offs acceptable?" In this particular case,
> the look-up table probably incurs a page fault on first use, and might even
> cause an extra page fault for programs that don't use memmem, by virtue of
> pushing apart other read-only data that is more frequently used. A micro-
> benchmark wouldn't capture this.

This would be quite hard to evaluate, but I agree that we should be parsimonious
about data segment increase. 

> 
> 7. "Is test coverage adequate?" If I understand correctly, the difficult
> cases from the strstr testsuite were not used for memmem, and there was
> no discussion of cases that hit the worst case for the proposed algorithm.

Yes, we are lacking some testing coverage for cases that might trigger
quadratic behavior on some case. I added some extra tests on my recent
wcsstr patch [1] but I do agree that we should improve it further.

> 
> I see AVX-512 strstr was accepted without mentioning it's O(n*m).

Yes, and I think it was a mistake (I was not aware of this until now).
So now we some arch optimizations for strstr/memmem/strcasestr:

  1. sysdeps/x86_64/multiarch/strstr-sse2-unaligned.S

  2. sysdeps/x86_64/multiarch/strstr-avx512.c

  3. sysdeps/powerpc/powerpc64/power8/strcasestr.S

  4. sysdeps/s390/strstr-arch13.S

  5. sysdeps/s390/memmem-arch13.S

The x86_64 sse2 one (1.) seems to be optimizing the linear search for
short needles similar to generic implementation (strstr2/strstr3).

I have not dig into the x86_64 avx one (2.), but if this really O(n*m) I
think we should remove it.

For powerpc my wild guess this is similar to the old ststr optimization 
where it was not really an improvement (1e9a550ba41a5453c6578bb748fe2223a87e3024).

The s390 ones (4., 5.) seems similar to x86_64 sse2 one where it optimizes
the linear search for short needles (but I not fully sure if it is not
O(n*m)).

So I think it would be worth to discuss if we should to remove the x86_64
avx512 one and set the bar to avoid adding new strstr/memmem/strcasestr
with O(n*m) behavior.

Thoughts?

> 
> Alexander

[1] https://patchwork.sourceware.org/project/glibc/patch/20240301171524.3706554-3-adhemerval.zanella@linaro.org/


More information about the Libc-alpha mailing list