Use of ymm16-ymm31 in the exex string/memory functions in sysdeps/x86_64/multtiarch sets HI16_ZMM_state to true. See: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf#page=321 This defeats the init optimization of various xsave* context switching instructions. Overall it adds at the very least 1024 bytes to context switches. Simple reproduction: ``` .global _start .text _start: vpxorq %ymm16, %ymm16, %ymm16 vzeroupper loop: jmp loop movl $60, %eax xorl %edi, %edi syscall ``` Then check: cat /proc/${pid}/arch_status Which will show the state being continuously updated. State is updated during context switch here: https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/kernel/fpu/core.c#L108)
My general opinion is that we should move the current evex function to evex-rtm and add a new class of evex function which may use avx512 functions but stay in the ymm0-ymm15 register range. Benefits: 1) evex instructions cost more code size (+2 bytes at least) 2) Its impossible to encode certain useful instructions with the evex prefix (i.e `vpcmpeq`) 3) We may be adding 1024 bytes to uses context switches. Costs: 1) vzeroupper is not free (in terms of code size or execution). 2) more total code size consumer by the library (this is limited by the fact that they will be in their own section and users will generally only stay in one section for all string/memory functions)
Also worth noting that if we stick in the vec0-vec15 range we may be able to get away with using `zmm` registers as `vzeroupper` does appear to effectively clear `ZMM_HI256_state` so any context switch/frequency burdens would be contained to the function.