Bug 28354 - For x86_64 string/memory functions use of EVEX registers sets HI16_ZMM_state adding context switch overhead
Summary: For x86_64 string/memory functions use of EVEX registers sets HI16_ZMM_state ...
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: string (show other bugs)
Version: 2.34
: P2 minor
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-20 02:23 UTC by Noah Goldstein
Modified: 2021-09-22 21:10 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-linux
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Noah Goldstein 2021-09-20 02:23:27 UTC
Use of ymm16-ymm31 in the exex string/memory functions in sysdeps/x86_64/multtiarch sets HI16_ZMM_state to true.
See: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf#page=321

This defeats the init optimization of various xsave* context switching instructions. Overall it adds at the very least 1024 bytes to context switches.

Simple reproduction:

```
	.global	_start
	.text
_start:
    vpxorq  %ymm16, %ymm16, %ymm16
    vzeroupper
    
loop:
    jmp loop


	movl	$60, %eax
	xorl	%edi, %edi
	syscall
```

Then check:

cat /proc/${pid}/arch_status


Which will show the state being continuously updated.

State is updated during context switch here: https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/kernel/fpu/core.c#L108)
Comment 1 Noah Goldstein 2021-09-20 02:28:55 UTC
My general opinion is that we should move the current evex function to evex-rtm and add a new class of evex function which may use avx512 functions but stay in the ymm0-ymm15 register range.

Benefits:

1) evex instructions cost more code size (+2 bytes at least)
2) Its impossible to encode certain useful instructions with the evex prefix (i.e `vpcmpeq`)
3) We may be adding 1024 bytes to uses context switches.


Costs:

1) vzeroupper is not free (in terms of code size or execution).
2) more total code size consumer by the library (this is limited by the fact that they will be in their own section and users will generally only stay in one section for all string/memory functions)
Comment 2 Noah Goldstein 2021-09-22 21:10:43 UTC
Also worth noting that if we stick in the vec0-vec15 range we may be able to get away with using `zmm` registers as `vzeroupper` does appear to effectively clear `ZMM_HI256_state` so any context switch/frequency burdens would be contained to the function.