faster memset

Mon May 26 23:24:00 GMT 2008

Eric Blake wrote:
> Aaron J. Grier <aaron <at> frye.com> writes:
> 
>> On Thu, May 22, 2008 at 04:56:54PM +0000, Eric Blake wrote:
>>> My patched assembly is no longer sensitive to alignment, and always
>>> gets the speed of 8-byte alignment.  This clinches it - for memset,
>>> x86 assembly is noticeably faster than C.
>> have you done comparisons with the builtin memset() in recent versions
>> of gcc?
>>
> 
> I was testing with gcc 3.4.4, which does have __builtin_memset.  But my 
> understanding is that __builtin_memset defers to the library function on cases 
> it cannot optimize at compile time?  At any rate, my test app called the 
> library function via a function pointer - does __builtin_memset even have an 
> address to be used via a function pointer?
> 
> If I understand it correctly, __builtin_memset(ptr,0,8) is a good example of 
> where the compiler optimization helps (it is faster to open-code two 32-bit 
> writes than to call a function), in which case that is faster than anything I 
> can code in assembly.  But __builtin_memset(ptr,0,1000), even though 1000 is 
> constant, starts to be such a large amount of open-coded assignments that the 
> compiler probably falls back to the library routine anyway, probably trusting 
> that the library knows more architecture tricks for efficiency than what you 
> can represent generically in gcc's builtin definition table.  Finally, 
> __builtin_memset(ptr,0,len) cannot be optimized, since len is not known at 
> compile time, so the compiler must fall back on the library.
> 
> In other words, by comparing against __builtin_memset, wouldn't I merely be 
> comparing against my own implementation for most of the interesting cases?
> 

gcc for i386 chooses the __builtin_memset where it recognizes 
possibilities to optimize code size.  gcc x86_64 default configuration 
calls the library function, except for those few cases such as you mention 
where a small number of int operations is suitable.  Only recently did 
glibc implement a memset() with good performance for long strings, agreed 
upon by developers for both AMD and Intel.  So it would be interesting to 
compare with that implementation.