[PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables

liqingqing liqingqing3@huawei.com
Thu May 28 13:47:03 GMT 2020


Hi Lu, thank you for your comment.
the REP_STOSB_THRESHOLD value 2M it's suit for the hardware platform what I used.
Cause I do not have some other x86 enviornments, so I can't make sure this change is good for all of it and  you are right.


On 2020/5/28 19:56, H.J. Lu wrote:
> On Fri, May 22, 2020 at 9:37 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Fri, May 22, 2020 at 9:10 PM liqingqing <liqingqing3@huawei.com> wrote:
>>>
>>> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
>>> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
>>> this api will use STOB to instead of  MOVQ
>>>
>>> but when I test this API on x86_64 platform
>>> and found that this default value is not appropriate for some input length. here it's the enviornment and result
>>>
>>> test suite: libMicro-0.4.0
>>>         ./memset -E -C 200 -L -S -W -N "memset_4k"    -s 4k    -I 250
>>>         ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k    -u -I 400
>>>         ./memset -E -C 200 -L -S -W -N "memset_1m"    -s 1m   -I 200000
>>>         ./memset -E -C 200 -L -S -W -N "memset_10m"   -s 10m -I 2000000
>>>
>>> hardware platform:
>>>         Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
>>>         L1d cache:32KB
>>>         L1i cache: 32KB
>>>         L2 cache: 1MB
>>>         L3 cache: 60MB
>>>
>>> the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
>>>
>>>         before this commit     after this commit
>>>                 cycle      cycle
>>> memset_4k       249         96
>>> memset_10k      657         185
>>> memset_36k      2773        3767
>>> memset_100k     7594        10002
>>> memset_500k     37678       52149
>>> memset_1m       86780       108044
>>> memset_10m      1307238     1148994
>>>
>>>         before this commit          after this commit
>>>            MLC cache miss(10sec)         MLC cache miss(10sec)
>>> memset_4k       1,09,33,823          1,01,79,270
>>> memset_10k      1,23,78,958          1,05,41,087
>>> memset_36k      3,61,64,244          4,07,22,429
>>> memset_100k     8,25,33,052          9,31,81,253
>>> memset_500k     37,32,55,449         43,56,70,395
>>> memset_1m       75,16,28,239         88,29,90,237
>>> memset_10m      9,36,61,67,397       8,96,69,49,522
>>>
>>>
>>> though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
>>> but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:
>>>
>>>
>>>
>>> From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
>>> From: liqingqing <liqingqing3@huawei.com>
>>> Date: Thu, 21 May 2020 11:23:06 +0800
>>> Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
>>> macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
>>> so update the defaule value to eliminate the decrement .
>>>
>>
>> There is no single threshold value which is good for all workloads.
>> I don't think we should change REP_STOSB_THRESHOLD to 1MB.
>> On the other hand, the fixed threshold isn't flexible.  Please try this
>> patch to see if you can set the threshold for your specific workload.
>>
> 
> Any comments, objections?
> 
> https://sourceware.org/pipermail/libc-alpha/2020-May/114281.html
> 



More information about the Libc-alpha mailing list