Bug 26091 - strcpy cost more time in glibc-2.31
Summary: strcpy cost more time in glibc-2.31
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: string (show other bugs)
Version: 2.31
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-08 02:04 UTC by JinhuiGuo
Modified: 2020-06-22 20:28 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
I reduced the bug to a stand-alone test case, now attached. (470 bytes, text/x-csrc)
2020-06-08 02:04 UTC, JinhuiGuo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description JinhuiGuo 2020-06-08 02:04:15 UTC
Created attachment 12602 [details]
I reduced the bug to a stand-alone test case, now attached.

When I use strcpy to copy ten byte of data, it takes 70ns in glibc-2.31 while 53ns in glibc-2.29. I found it related to the address of strcpy. When the address of strcpy is 32-byte alignment, it takes less time than 16-byte alignment.

------------------------------------------------------------------------
testcase                 address           alignment          time(ns)
------------------------------------------------------------------------
strcpy_10_libmicro       0x95AF0           16	              70.48611
strcpy_10_libmicro	 0x95C90	   16	              69.54695
strcpy_10_libmicro	 0x95C10	   16	              69.0097
strcpy_10_libmicro	 0x95AE0	   32	              53.42931
strcpy_10_libmicro	 0x95B00	   32	              53.28875
strcpy_10_libmicro	 0x95B20	   32	              53.29308
strcpy_10_libmicro	 0x95B40	   32	              53.31686
strcpy_10_libmicro	 0x95B60	   32	              53.28691
------------------------------------------------------------------------

Thus, should it be 32-byte alignment?

 14 diff --git a/sysdeps/powerpc/powerpc32/strcpy.S b/sysdeps/powerpc/powerpc32/strcpy.S
 15 index 0067e76..7a8badd 100644
 16 --- a/sysdeps/powerpc/powerpc32/strcpy.S
 17 +++ b/sysdeps/powerpc/powerpc32/strcpy.S
 18 @@ -22,7 +22,7 @@
 19
 20  /* char * [r3] strcpy (char *dest [r3], const char *src [r4])  */
 21
 22 -EALIGN (strcpy, 4, 0)
 23 +EALIGN (strcpy, 5, 0)
 24
 25  #define rTMP   r0
 26  #define rRTN   r3      /* incoming DEST arg preserved as result */
 27 --
 28 2.12.3
Comment 1 JinhuiGuo 2020-06-08 02:05:31 UTC
test case

I reduced the bug to a stand-alone test case, now attached.
Comment 2 JinhuiGuo 2020-06-08 02:06:02 UTC
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <string.h>

int s = 10;
int unaligned = 0;

void init_str(char *str)
{
	static char *demo =
		"The quick brown fox jumps over the lazy dog.";
	int l = strlen(demo);
	int i;
	for (i = 0; i < s; i++) {
		str[i] = demo[i % l];
	}

	str[s] = 0;
}

int main(void)
{
	int i;
	struct timespec tv;
	struct timespec tv1;
	
	char *src2 = (char *)malloc(s + 1);
	char *src = (char *)malloc(s + 1 + unaligned);
    	init_str(src2);
	src2 += unaligned;
	
	clock_gettime(CLOCK_MONOTONIC, &tv);
	
	for (i = 0; i < 1100000; i += 10) {
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
		(void) strcpy(src, src2);
	}
	
	clock_gettime(CLOCK_MONOTONIC, &tv1);
	long long  tmp = ((long long)tv1.tv_sec * 1000000000LL) - ((long long)tv.tv_sec * 1000000000LL)  + ((long long)tv1.tv_nsec ) - ((long long)tv.tv_nsec);
	printf("cost: %f ns\n", ((double)tmp) / i);
	
	src2 -= unaligned;
	free(src);
	free(src2);
	
	return 0;
}
Comment 3 Adhemerval Zanella 2020-06-22 20:28:01 UTC
(In reply to JinhuiGuo from comment #0)
> Created attachment 12602 [details]
> I reduced the bug to a stand-alone test case, now attached.
> 
> When I use strcpy to copy ten byte of data, it takes 70ns in glibc-2.31
> while 53ns in glibc-2.29. I found it related to the address of strcpy. When
> the address of strcpy is 32-byte alignment, it takes less time than 16-byte
> alignment.
> 
> ------------------------------------------------------------------------
> testcase                 address           alignment          time(ns)
> ------------------------------------------------------------------------
> strcpy_10_libmicro       0x95AF0           16	              70.48611
> strcpy_10_libmicro	 0x95C90	   16	              69.54695
> strcpy_10_libmicro	 0x95C10	   16	              69.0097
> strcpy_10_libmicro	 0x95AE0	   32	              53.42931
> strcpy_10_libmicro	 0x95B00	   32	              53.28875
> strcpy_10_libmicro	 0x95B20	   32	              53.29308
> strcpy_10_libmicro	 0x95B40	   32	              53.31686
> strcpy_10_libmicro	 0x95B60	   32	              53.28691
> ------------------------------------------------------------------------

I am seeing the opposite on gcc203 (POWER8) where changing the alignment to 32 (EALIGN (..., 5, 0) increases the cost from ~11.64 to ~12.41 to each call. This is using the provided benchmark.

In fact this is really micro-arch dependent, where icache alignment might or not imposes performance issues.  GCC also seems to use different alignment depending of the target processor (-mcpu=xxx) and the default for powerX is ยด .palign 4,,15'.

So to actually change the default alignment I would like to check if this is not a pessimization on generic powerpc32 as it seems for POWER.

> 
> Thus, should it be 32-byte alignment?
> 
>  14 diff --git a/sysdeps/powerpc/powerpc32/strcpy.S
> b/sysdeps/powerpc/powerpc32/strcpy.S
>  15 index 0067e76..7a8badd 100644
>  16 --- a/sysdeps/powerpc/powerpc32/strcpy.S
>  17 +++ b/sysdeps/powerpc/powerpc32/strcpy.S
>  18 @@ -22,7 +22,7 @@
>  19
>  20  /* char * [r3] strcpy (char *dest [r3], const char *src [r4])  */
>  21
>  22 -EALIGN (strcpy, 4, 0)
>  23 +EALIGN (strcpy, 5, 0)
>  24
>  25  #define rTMP   r0
>  26  #define rRTN   r3      /* incoming DEST arg preserved as result */
>  27 --
>  28 2.12.3