Bug 27227 - Memory corruption for altivec unaligned load / store
Summary: Memory corruption for altivec unaligned load / store
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: malloc (show other bugs)
Version: 2.32
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-22 17:45 UTC by Adam Stylinski
Modified: 2021-01-22 22:23 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Stylinski 2021-01-22 17:45:36 UTC
I'm not 100% sure this is a bug in assumptions made by the allocator or this is a data hazard I've always had.  I'm seeing memory corruption in the following scenario (I might try to produce a minimal case but if somebody can tell me what I'm doing is in fact dumb before then, then I won't have to do so):

__vector loadu_f32(float *v)
{
    __vector unsigned char permute = vec_lvsl(0, (unsigned char*)v);
    __vector unsigned char lo = vec_ld(0, v);
    __vector unsigned char hi = vec_ld(16, v);
    return vec_perm(lo, hi, permute);
}

float *ptr3d = (float*)_mm_malloc(3 * sizeof(float) * someSizeNotModulusOf4, 16);
size_t numIterations = someSizeNotModulusOf4 / 4;

/* assume ptr3d was populated at some point, in some scalar loop */

for (size_t i = 0; i < numIterations; ++i) {
    __vector x = vec_ld(i*4, ptr3d);
    __vector y = loadu_f32(&ptr3d[someSizeNotModulusOf4+4*i]);
    __vector z = loadu_f32(&ptr3d[2*someSizeNotModulusOf4+4*i]);
}

/* remainder peeling scalar loop goes here */


This structure of a loop seems to cause memory corruption, with assertions being thrown in malloc.  As far as I can tell, this is _the_ convention to use for unaligned loads for altivec enabled powerpc machines that don't have VSX:
http://mirror.informatimago.com/next/developer.apple.com/hardware/ve/alignment.html

According to this, so long as some bytes of the second half being loaded (hi) are on the heap bounds, it should be safe based on heap alignment.  Now, this is Linux with glibc, not Mac OS X, but I swear this code worked cleanly before and doesn't seem to be now.  Even weirder, if for malloc I drop in something that does oversized allocations by 16 bytes, it won't crash, but I see some evidence of corruption sometime after FFTW uses some altivec enabled kernels.  When compiling with asan, it catches the out of heap bounds load and complains loudly without the oversized allocations.  I could see this being a false positive, for the second half of the load, but it doesn't make sense why the unaligned loads and stores are then causing corruption later when oversized.

Strangely, LD_PRELOAD'ing Intel's TBB pool allocator proxy for malloc and free make these symptoms disappear entirely (it's probably oversizing _all_ calls to malloc/posix_memalign).  

Did some behavior change with regard to this or was this way of doing misaligned loads always hazardous and I managed to get lucky?
Comment 1 Andreas Schwab 2021-01-22 18:11:08 UTC
There is not a single memory write in your example, thus any memory corruption must have happened somewhere else.  In any case, nothing here looks related to glibc.  Try valgrind.
Comment 2 Adam Stylinski 2021-01-22 18:21:59 UTC
(In reply to Andreas Schwab from comment #1)
> There is not a single memory write in your example, thus any memory
> corruption must have happened somewhere else.  In any case, nothing here
> looks related to glibc.  Try valgrind.

Sorry, that was definitely not a complete example, I was simply showing the access pattern.  The memory writes are occurring in similar loops, where the unaligned writes happen the same way as the loads, but the writes are handled exactly as they were in Apple's documentation:

void StoreUnaligned( vector unsigned char src, void *target )
{

    vector unsigned char MSQ, LSQ;
    vector unsigned char mask, align, zero, neg1;
    MSQ = vec_ld(0, target);

    // most significant quadword
    LSQ = vec_ld(16, target);
    // least significant quadword
    align = vec_lvsr(0, target);
    // create alignment vector
    zero = vec_spat_u8( 0 );
    // Create vector full of zeros
    neg1 = vec_splat_s8( -1 );
    // Create vector full of -1
    mask=vec_perm(zero,neg1,align);
    Create select mask
    src=vec_perm(src,src,align );

    // Right rotate stored data
    MSQ = vec_sel( MSQ, src, mask );
    // Insert data into MSQ part
    LSQ = vec_sel( src, LSQ, mask );
   // Insert data into LSQ part
   vec_st( MSQ, 0, target );
   // Store the MSQ part
   vec_st( LSQ, 16, target );
   // Store the LSQ part
}

I'm asking first if my access pattern for both loads and stores is faulty to begin with, or if something changed in glibc recently for the POWER ABI with regard to alignment.
Comment 3 Florian Weimer 2021-01-22 20:36:52 UTC
Is your program multi-threaded? The Apple document actually contains a warning regarding that.

powerpc heap layout changed significantly due to the ix for bug 6527; maybe that's why your program appeared to work reliably before.
Comment 4 Adam Stylinski 2021-01-22 22:20:23 UTC
It can be but not in the current test scenario I have.  Doing vec_ld(15, ptr) for the second half essentially means that the load will never span outside the heap in the even that the address just so happens to be aligned, no?

Basically the final unaligned loads in the last column are triggering asan and in general are causing issues.  I suspect the logic for unaligned stores are causing similar grief.  I _thought_ I had been doing exactly what Apple had mentioned here:

> Typically this means that a looping function will have to stop one loop iteration before it reaches the end of the data run, and handle the last few bytes in special case code. 

Is this now broken or had I been doing the remainder peeling incorrectly all along?
Comment 5 Adam Stylinski 2021-01-22 22:23:51 UTC
Apologies for my ignorance by the way, I've been mostly spoiled lately with x86 and aarch64 which handle this to some degree for you in the uarch.  I wrote this code a long time ago when I was just getting familiar with Altivec, so I'm well aware what I was doing could have been completely wrong and/or stupid.  At the very least I could have done more to pipeline the unaligned loads so that the previous vector load for the second half was reused (I don't think GCC can do this for me).  

I'm hoping people who are intimately familiar with the ABI and heap layout of big endian POWER4 can clarify if I'm wrong, the library is wrong, or both.