This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: Optimized st{r,p}ncpy for POWER8/PPC64


On 06-01-2015 16:53, OndÅej BÃlka wrote:
> On Thu, Jan 01, 2015 at 05:14:52PM -0200, Adhemerval Zanella wrote:
>> This patch adds an optimized POWER8 st{r,p}ncpy using unaligned accesses.
>> It shows 10%-80% improvement over the optimized POWER7 one that uses
>> only aligned accesses, specially on unaligned inputs.
>>
> This could hurt overall performance as strncpy/stpncpy are rarely used
> and should be avoided where performance is concern. Optimizing for cache
> usage would be better, I would not be surprised that runtime cost was
> dominated by time these functions are read from disk/get into cache.

I disagree with you since current powerpc code does byte-a-byte accesses
on unaligned cases, which drags down performance a lot in such cases (and
I also aware that these are not the dominant cases).  And I am aware that
for performance cases this function should not be used (mainly because of
final zero-pad), however without any real data usage showing that cache 
is the dominant factor, I would like to include this optimization.

>
> Also I would compare this with generic strnlen+memcpy to call power8
> versions when possible.

Yes, I checked against a simple one:

char *
__strncpy_test (char *d, const char *s, size_t n)
{
  size_t size = strnlen (s, n);
  memcpy (d, s, size);
  if (size < n)
    memset (d+size, 0, n - size);
  return d;
}

And this implementation is indeed faster, mainly because it avoid accessing the
stream twice, doing the copy while it is checking for null and the length. This
algorithm only shows better performance in the case where the zero-pad value
is very large (more than 512 bytes) compared to string size. It is mainly due
POWER7/POWER8 memset uses VSX instructions instead of just dobleword load/store.

>
> Main question is why there is no power8 memcpy using unaligned loads yet?
>
> Memcpy is called about hundred times more often than strcpy(and no
> strncpy call) on my computer so possible gains are bigger and with 
> optimized memcpy a generic strncpy will be faster as well.

Mainly because powerpc still triggers kernel traps when issuing VMX/VSX instruction
on non-cacheable memory. That's why I pushed 87868c2418fb74357757e3b739ce5b76b17a8929
by the way.

Although it is not really an issue for 99% of cases, where memory will be cacheable;
some code (specially libdrm and xorg), uses memcpy (and possible memset) on DMA mapped
memory.  And that's why memcpy/memset for POWER8 are still using aligned accesses all
the time.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]