Contents
- Fixup dl-trampoline.S
- Use Cache-Line Size Querying String Routines in ld.so
- Make Malloc Return Type Alignment Honor ISO C-spec
- Improve Time Performance
- Improve rand/random Performance
- Optimize Wrappers For Libm
- Optimize itoa_word using DFP Instructions for P6 and P7
- Optimize math functions for P7: __ieee754_exp, ___ieee754_pow, __dubsin
- Thread Priority
- Build Wide Character Strings Functions With -O3 Optimization
- Memcpy Optimizations for 32-bit and 64-bit Cell
- Fix Excessive Implies Files
- Remove no-fpu Context Routine Save and Restore of fprs
Fixup dl-trampoline.S
Save VRS in fpu dl-trampoline.S
- Need to save non-volatile VRS in:
- sysdeps/powerpc/powerpc32/dl-trampoline.S
- sysdeps/powerpc/powerpc64/dl-trampoline.S
Move dl-trampoline.S
- Currently in:
- sysdeps/powerpc/powerpc32/dl-trampoline.S
- sysdeps/powerpc/powerpc64/dl-trampoline.S
- Todo: Move to:
- sysdeps/powerpc/powerpc32/fpu/dl-trampoline.S
- sysdeps/powerpc/powerpc64/fpu/dl-trampoline.S
Add nofpu version of dl-trampoline.S
- Current versions of powerpc dl-trampoline.S use FP regs and are picked up by soft-float build.
- Currently only a problem when building a profile build.
- Todo: Clone and modify for nofpu which doesn't use FP regs in:
- sysdeps/powerpc/powerpc32/nofpu/dl-trampoline.S
- sysdeps/powerpc/powerpc64/nofpu/dl-trampoline.S
Use Cache-Line Size Querying String Routines in ld.so
When building a toolchain for the Power Architecture it is standard to build with -mcpu=power4 (or later) for the base. This populates lib[64]/ with power4 tuned versions of ld[64].so and libc.so, et al as the 'base' libraries.
Following this, optimized versions of libc.so, libm.so, et al are put into lib[64]/power-foo/ but not an optimized version of ld[64].so.
The 'base' ld[64].so is used always used. It loads the optimized lib[64]/power-foo/ libraries at application runtime.
This can be problematic when 'base' is built with -mcpu=power4 which has a 128-byte cache-line size but ld[64].so is running on power-foo which has a 64-byte cache-line size.
In this case the base ld[64].so using the power4 optimized string routines makes an incorrect assumption of the cache-line size of power-foo.
To solve this the dynamic link loader shouldn't make an assumption of cache-line size in the string routines it uses. The dynamic-linker should always be built using the cache-line-size querying string routines. Then the optimized versions of libc.so can use the hard-coded cache-line size.
- Currently building with a base of -mcpu=power4 will select:
- sysdeps/powerpc/powerpc32/power4/memcpy.S
- sysdeps/powerpc/powerpc64/power4/memcpy.S
- This is okay for libc.so but ld[64].so needs to have the file selection hard coded to use:
- string/memcpy.c
Make Malloc Return Type Alignment Honor ISO C-spec
- This is required by the PowerPC ABIs as well for the Long Double 128, _Decimal128, and Vector Scalar data-types. Unfortunately the impacts are into common code.
Most recent attempt: http://sourceware.org/ml/libc-alpha/2007-11/msg00062.html
- Rejection Rationale: Ulrich Drepper says this breaks EMACS.
- Suggested approach:
- Add version wrappers around the old versions of the function.
- Create new versions of the function which follow the ISO C Spec and psABIs where the sizeof the malloc return type is aligned with the size of the largest supported data-type.
- Make sure EMACS explicitly uses the old versions.
Improve Time Performance
- unix/time.c calls gettimeofday the linux/time.c uses the time syscall
- gettimeofday will use the vdso
- for vdso platforms the unix/time.c is better then the linux/time.c
- copy sysdeps/unix/sysv/linux/sparc/sparc64/time.c to sysdeps/unix/sysv/linux/powerpc/powerpc64/time.c
- for extra credit create a linux/powerpc/time.c that calls vdso directly.
- vdso availability is based on a kernel feature. The gettimeofday implementations call INLINE_VSYSCALL(gettimeofday). The INLINE_VSYSCALL macro checks for vdso availability. Lacking that is calls the slower syscall.
Improve rand/random Performance
- For 'random' calls of TYPE_0 the sequence:
if (buf->rand_type == TYPE_0) { int32_t val = state[0]; val = ((state[0] * 1103515245) + 12345) & 0x7fffffff; state[0] = val; *result = val; }- doesn't need to be protected by a heavy weight lock and unlock when a compare and swap will suffice.
- Todo: In stdlib/random.c:
__libc_lock_lock (lock); (void) __srandom_r (x, &unsafe_state); __libc_lock_unlock (lock);
- Add a condition check for TYPE_0:
if (x->rand_type == TYPE_0) { /* Compare and Exchange and set '&unsafe_state' */ } else { __libc_lock_lock (lock); (void) __srandom_r (x, &unsafe_state); __libc_lock_unlock (lock); }
Optimize Wrappers For Libm
For P5 and P6 inline the isnan, isinf etc tests into the wrapper functions (w_pow.c, w_exp.c, w_log.c, w_sin.c ....). For P7 inline the ftdiv instruction, especially for w_pow as it needs to test 2 operands for isnan(), ifinite(), sign,....
Optimize itoa_word using DFP Instructions for P6 and P7
- Converting into to DPD then DPD to packed then converting to string will be faster than the multiply/divide by 10.
Optimize math functions for P7: __ieee754_exp, ___ieee754_pow, __dubsin
- Verify where we can gain some performance in these functions using P7 instructions.
Thread Priority
- Per ISA 2.06: Section 3.1 (page 671) titled "Program Priority Registers", The 'normal' process priority level has changed from "001 medium" to "100 medium low".
Reference "Program Priority Registers per ISA 2.06".
- Leveraging this change allows two levels of performance improvement.
- Kernel Change - Make priority changes permanent. Without the kernel patch the priority will return to the old normal on a context switch.
- GLIBC change - Optimize locking loops to increase priority when lock is taken, decrease when lock is tried but missed, and return priority when lock is released.
- If there is no accompanying kernel change the GLIBC patch is still beneficial. If a thread is in a critical section and the critical section is long enough that a context switch takes place the critical section is probably too long anyway so the return to the old-normal priority is OK.
Build Wide Character Strings Functions With -O3 Optimization
Memcpy Optimizations for 32-bit and 64-bit Cell
Fix Excessive Implies Files
- Removing excessive Implies files from the powerpc sysdep directories should return the search order to some semblance of hierarchical sanity.
Remove no-fpu Context Routine Save and Restore of fprs
Currently the context routines reside in powerpc/powerpc[32|64]/ and they contain saving and restoring of fprs and the fpscr.
This breaks soft-float. This should be fixed so that the fpr and fpscr save and restore are moved into the sysdeps powerpc/powerpc[32|64]/fpu directory.