[PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
liqingqing
liqingqing3@huawei.com
Thu May 28 13:47:03 GMT 2020
Hi Lu, thank you for your comment.
the REP_STOSB_THRESHOLD value 2M it's suit for the hardware platform what I used.
Cause I do not have some other x86 enviornments, so I can't make sure this change is good for all of it and you are right.
On 2020/5/28 19:56, H.J. Lu wrote:
> On Fri, May 22, 2020 at 9:37 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Fri, May 22, 2020 at 9:10 PM liqingqing <liqingqing3@huawei.com> wrote:
>>>
>>> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
>>> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
>>> this api will use STOB to instead of MOVQ
>>>
>>> but when I test this API on x86_64 platform
>>> and found that this default value is not appropriate for some input length. here it's the enviornment and result
>>>
>>> test suite: libMicro-0.4.0
>>> ./memset -E -C 200 -L -S -W -N "memset_4k" -s 4k -I 250
>>> ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k -u -I 400
>>> ./memset -E -C 200 -L -S -W -N "memset_1m" -s 1m -I 200000
>>> ./memset -E -C 200 -L -S -W -N "memset_10m" -s 10m -I 2000000
>>>
>>> hardware platform:
>>> Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
>>> L1d cache:32KB
>>> L1i cache: 32KB
>>> L2 cache: 1MB
>>> L3 cache: 60MB
>>>
>>> the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
>>>
>>> before this commit after this commit
>>> cycle cycle
>>> memset_4k 249 96
>>> memset_10k 657 185
>>> memset_36k 2773 3767
>>> memset_100k 7594 10002
>>> memset_500k 37678 52149
>>> memset_1m 86780 108044
>>> memset_10m 1307238 1148994
>>>
>>> before this commit after this commit
>>> MLC cache miss(10sec) MLC cache miss(10sec)
>>> memset_4k 1,09,33,823 1,01,79,270
>>> memset_10k 1,23,78,958 1,05,41,087
>>> memset_36k 3,61,64,244 4,07,22,429
>>> memset_100k 8,25,33,052 9,31,81,253
>>> memset_500k 37,32,55,449 43,56,70,395
>>> memset_1m 75,16,28,239 88,29,90,237
>>> memset_10m 9,36,61,67,397 8,96,69,49,522
>>>
>>>
>>> though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
>>> but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:
>>>
>>>
>>>
>>> From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
>>> From: liqingqing <liqingqing3@huawei.com>
>>> Date: Thu, 21 May 2020 11:23:06 +0800
>>> Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
>>> macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
>>> so update the defaule value to eliminate the decrement .
>>>
>>
>> There is no single threshold value which is good for all workloads.
>> I don't think we should change REP_STOSB_THRESHOLD to 1MB.
>> On the other hand, the fixed threshold isn't flexible. Please try this
>> patch to see if you can set the threshold for your specific workload.
>>
>
> Any comments, objections?
>
> https://sourceware.org/pipermail/libc-alpha/2020-May/114281.html
>
More information about the Libc-alpha
mailing list