pthread_cond performence Discussion
liqingqing
liqingqing3@huawei.com
Sat May 23 04:04:54 GMT 2020
this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
this api will use STOB to instead of MOVQ
but when I test this API on x86_64 platform
and found that this default value is not appropriate for some input length. here it's the enviornment and result
test suite: libMicro-0.4.0
./memset -E -C 200 -L -S -W -N "memset_4k" -s 4k -I 250
./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k -u -I 400
./memset -E -C 200 -L -S -W -N "memset_1m" -s 1m -I 200000
./memset -E -C 200 -L -S -W -N "memset_10m" -s 10m -I 2000000
hardware platform:
Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
L1d cache:32KB
L1i cache: 32KB
L2 cache: 1MB
L3 cache: 60MB
the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
before this commit after this commit
cycle cycle
memset_4k 249 96
memset_10k 657 185
memset_36k 2773 3767
memset_100k 7594 10002
memset_500k 37678 52149
memset_1m 86780 108044
memset_10m 1307238 1148994
before this commit after this commit
MLC cache miss(10sec) MLC cache miss(10sec)
memset_4k 1,09,33,823 1,01,79,270
memset_10k 1,23,78,958 1,05,41,087
memset_36k 3,61,64,244 4,07,22,429
memset_100k 8,25,33,052 9,31,81,253
memset_500k 37,32,55,449 43,56,70,395
memset_1m 75,16,28,239 88,29,90,237
memset_10m 9,36,61,67,397 8,96,69,49,522
though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:
>From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
From: liqingqing <liqingqing3@huawei.com>
Date: Thu, 21 May 2020 11:23:06 +0800
Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
so update the defaule value to eliminate the decrement .
---
sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index dcd63c92..92c08eed 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -65,7 +65,7 @@
Enhanced REP STOSB. Since the stored value is fixed, larger register
size has minimal impact on threshold. */
#ifndef REP_STOSB_THRESHOLD
-# define REP_STOSB_THRESHOLD 2048
+# define REP_STOSB_THRESHOLD 1048576
#endif
#ifndef SECTION
--
2.19.1
More information about the Libc-alpha
mailing list