This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC x86_64] patch for improving selection order for string and memory routines.


Hi ,

This is a RFC patch to improve selection logic for string and memory routines for x86_64 arch.

Why required (Selection logic means selection order defined in IFUNC function of string and memory routines in GLIBC) ?
1) Current selection logic for few routines does not select faster version incase of AMD CPU's though it supports.
2) Does not allow to choose fast implementations for both old and new generation CPU's.
3) Updating selection order for newer CPU's may require lot of testing to verify the new order for older CPU's. Otherwise
   new selection order will lead to performance degradation or slower versions are selected though faster one supported.

I am proposing following approach to address these issues and tested this attached RFC patch for memset and memcmp
routines on Excavator and Haswell machines.

1) Define a structure array with all implementation details and map these implementations to their respective
   CPU or ARCH flags found in GLIBC. Each row in a array contains details of multi-versioning functions and their 
   CPU or ARCH flags for each string and memory routines. Each string and memory routines are given ID's and 
   these ID's are used to refer back.

------------- EXAMPLE --------------

const cpu_selection string_implementations[TOTAL_STRING_ROUTINES] =  {
  /* Memset */
  [index_MEMSET]=
    {
      .n_implementation= 2,
      [index_MEMSET_SSE2] = /* SSE2 specific details*/
      [index_MEMSET_AVX2_Unaligned] = /* AVX2 specific details*/
    },
  /* MEMCMP */
  [index_MEMCMP]=
    {
      .n_implementation= 3,
      [index_MEMCMP_SSE2] = /* SSE2 specific details*/
      [index_MEMCMP_SSSE3] = /* SSSE3 specific details*/
      [index_MEMCMP_SSE4_1] = /* SSE4_1 specific details*/
    },
  ...,
  ...
};

-------------------------------------

2) Define two functions one for INTEL and AMD. These functions will define required order for their CPU's 
   based on family and model values in local array. This local array is two dimensional one and using string index ID'S
   define the selected order as indexes (two types of indexes are defined. One for string and memory functions, 
   second one for ISA specific). If unknown CPU is detected then fallback to current order defined in GLIBC.

------------------------- CODE snippet -----------------------
#define index_MEMSET 0
#define index_MEMSET_SSE2 0
#define index_MEMSET_AVX2 1

#define index_MEMCMP 1
#define index_MEMCMP_SSE2 0
#define index_MEMCMP_SSE2_Unaligned 1
#define index_MEMCMP_SSSE3 2
...
...
...
   
--------- Local array definition -----------
 unsigned int s_order[Total_string_functions][Max_implementations];

s_order[index_MEMSET][FIRST_CHOICE] = 1  /*AVX2 Unaligend*/;
s_order[index_MEMSET][SECOND_CHOICE] = 0 /* SSE2 */ ;

s_order[index_MEMCMP][FIRST_CHOICE] = 2  /*SSE4_1*/;
s_order[index_MEMCMP][SECOND_CHOICE] = 1 /*SSSE3 */ ;
s_order[index_MEMCMP][THIRD_CHOICE] = 0  /*SSSE3 */ ;
---------------------------------------------


3) These two functions will call common function by passing this local array. Given order is verified and also CPU or ARCH flag
   support will be tested. That implementation is selected if support is found and this value kept in another global array 
   at that string index ID for each function.
4) Define another function that will be called by string and memory ifunc function by passing respective index id's. This function reads
   the global array at that index and value present here will be used to retrieve implementation details and return the function address.

Advantage of this method:
1) Allows separate selection order for each CPU's.
2) Allows faster version to be selected for that CPU.
3) Observed another point where single implementation does not perform fast across all string sizes (small vs larger string sizes).
   Based an application requirement a particular implementation can be allowed to execute through environment variable (Tunable).
   Not implemented yet but can be made possible.

Please give your feedback on this.

Thanks,
Amit

Attachment: 0001-RFC-x86_64-patch-to-improve-selection-order-for-stri.patch
Description: 0001-RFC-x86_64-patch-to-improve-selection-order-for-stri.patch


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]