Scheduling x86 dispatch windows

Jeff Law
Thu Jun 10 20:03:00 GMT 2010

On 06/10/10 13:52, H.J. Lu wrote:
> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
> <>  wrote:
>> Cross-posting Reza's call for feedback to the binutils list since it
>> is relevant -
>> see the last few paragraphs regarding how to "solve the alignment problem".
>> Original thread:
>> Not sure if followups should occur on one list or both.
>> --
>> Quentin Neill
>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<>  wrote:
>>> Hi,
>>> We are in the process of adding a feature to GCC to take advantage
>>> of a new hardware feature in the latest AMD micro processor. This
>>> feature requires a certain mix, ordering and alignments in
>>> instruction sequences to obtain the expected hardware performance.
>>> I am asking the community to review this high level implementation
>>> design and give me direction or advice.
>>> The new hardware issues two windows of the size N bytes of
>>> instructions in every cycle. It goes into accelerate mode if the
>>> windows have the right combination of instructions or alignments. Our
>>> goal is to maximize the IPC by proper instruction scheduling and
>>> alignments.
>>> Here is a summary of the most important requirements:
>>> a) Maximum of N instructions per window.
>>> b) An instruction may cross the first window.
>>> c) Each window can have maximum of x memory loads and y memory
>>>     stores .
>>> d) The total number of immediate constants in the instructions
>>>     of a window should not exceed k.
>>> e) The first window must be aligned on 16 byte boundary.
>>> f) A Window set terminates when a branch exists in a window.
>>> g) The number of allowed prefixes varies for instructions.
>>> h) A window set needs to be padded by prefixes in instructions
>>>     or terminated by nops to ensure adherence to the rules.
>>> We have the following implementation plan for GCC:
>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>     instructions for the two dispatch windows. The scheduler is called
>>>     once before and once after register allocation as usual. In both
>>>     cases it performs dispatch scheduling along with its normal job of
>>>     instruction scheduling.
>>> The advantage of doing it before register allocation is avoiding
>>> extra dependencies caused by register allocation which may become
>>> an obstacle to movement of instructions.  The advantage of doing
>>> it after register allocation is a consideration for spilling code
>>> which may be generated by the register allocator.
>>> The algorithm we use is:
>>> a) Considering the current dispatch window set, choose the first
>>>     instruction from ready queue that does not violate dispatch rules.
>>> b) When an instruction is selected and scheduled, inform the
>>>     dispatcher code about the instruction. This step keeps track of the
>>>     instruction content of windows for future evaluation. It also manages
>>>     the window set by closing and opening new virtual dispatch windows.
>>> 2) Insertion of alignment code.
>>> In x86 alignment is done by inserting prefixes or by generating
>>> nops. As the object code is generated by the assembler in GCC, some
>>> information such as sizes of branches are unknown until assembly or
>>> link time. To do alignments related to dispatch correctly in GCC,
>>> we need to iteratively compute prefixes and branch sizes until
>>> its convergence. This pass currently does not exist in GCC, but it
>>> exists in the assembler.
>>> There are two possible approaches to solve alignment problem.
>>> a)  Let the assembler performs the alignments and padding needed
>>>      to adhere with the new machine dispatching rules and avoid an extra
>>>      pass in GCC.
>>> b)  Add a new pass to mimic what assembler does before generating
>>>      the assembly listing in GCC and insert the required alignments.
>>> I appreciate your comments on the proposed implementation procedure
>>> and the choices a or b above.
> I don't this should be done in assembler. Assembler should just assemble
> the assembly input.
That adds quite a bit of complication to the compiler though -- getting 
the instruction lengths right (and thus proper packing & alignment) can 
be extremely difficult.  I did some experiments with this on a target 
with *fixed* instruction lengths a while back and even though the port 
tried hard to get lengths right, it would routinely miss something.  
Ultimately I decided that it forcing the compiler to know instruction 
lengths with a very high degree of accuracy wasn't a sane thing to 
do.    Dealing with variable instruction lengths just adds yet another 
complexity to the situation.  Then add the complication of needing to 
add specific prefixes or nops and it just gets downright ugly.

I'd probably approach this by having the compiler emit a directive which 
states what the desired alignment at a particular point should be, then 
allow the assembler to select the best method to get the desired alignment.


More information about the Binutils mailing list