CPU dispatching in libc

Agner Fog agner@agner.org
Thu Aug 7 10:15:00 GMT 2008


Ryan S. Arnold wrote:
 >Agner wrote:
 >> Does such a CPU dispatching exist in libc? How does it work? It should
 >>be possible to compile a static binary on a system with SSE-whatever,
 >>and run it on a system with SSE-something-else. Therefore, I want the
 >>CPU-dispatching to be inside libc.

 >We (IBM) had discussions with AMD and Intel at the 2007 GCC Summit where
 >they indicated that they were interested in dynamic runtime checks for
 >hardware capability which would route the application to the correct CPU
 >optimized function implementation while the application was running by
 >using a first-time-called hwcap check.
 >The 'first-time-called' hwcap check would work by having a wrapper
 >function check to see if it had an internal function pointer set for an
 >optimized version of the function. If not, then it'd check the hwcap
 >for the specific platform information, find the correct function pointer
 >and set it. Subsequent calls wouldn't pay this resolution
 >penalty. I'm not sure if they made any progress on this. H.J. Lu at
 >Intel would probably be able to tell you.

The framework for CPU dispatching must be in place before any progress 
can be made. So this is the reason why the memory and string functions 
are so slow in libc. What are you doing with math functions? Most other 
libraries use SSE2 for math functions if available. I can't find the 
math functions in libc, so I don't know what you are doing here.

 >You should contact H.J Lu (via email and CC this mailing list) and ask
 >him if they made any progress with their 'first-time-called'
 >optimization checks idea.

I have CC'ed this mail to him.

If CPU dispatching is not implemented yet, here is my proposal for an 
efficient mechanism:
The function entry has  JMP POINTER where POINTER is a pointer stored in 
the data segment.
POINTER initially points to a dispatcher. The dispatcher calls a 
function WhichInstructionSetDoIHave. According to the value received, it 
changes POINTER to point to the optimal version of the code. Then jumps 
to [POINTER]. The next time the function is called, it goes through 
POINTER directly to the optimal version. The cost of dispatching is then 
just one single instruction, except for the first time. (A 32-bit 
position-independent version needs to get a reference thunk into ecx first).

The most probable path should be immediately after JMP POINTER.

The WhichInstructionSetDoIHave function reads its value from a variable 
CurrentInstructionSet in the data segment. This variable is initially 
zero, indicating that it must use CPUID etc. to determine the 
instruction set. It is possible to detect whether XMM registers are 
enabled by using the FXSAVE/FXRSTOR instructions rather than asking the 
operating system or catching an exception. This will make it easier to 
port libc to different operating systems.

For testing purposes, it should be possible to change the value of 
CurrentInstructionSet. Set it to a lower value for testing older 
versions, set it to a higher value for testing new versions if you have 
an emulator for that instruction set.



More information about the Libc-help mailing list