Scheduling x86 dispatch windows

Thu Jun 10 20:59:00 GMT 2010

On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <law@redhat.com> wrote:
> On 06/10/10 13:52, H.J. Lu wrote:
>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>> <quentin.neill.gnu@gmail.com>  wrote:
>>> Cross-posting Reza's call for feedback to the binutils list since it
>>> is relevant - s ee the last few paragraphs regarding how to
>>> "solve the alignment problem".
>>>
>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>
>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_reza@yahoo.com>
>>>  wrote:
>>>> Hi,
>>>>
>>>> We are in the process of adding a feature to GCC to take advantage
>>>> of a new hardware feature in the latest AMD micro processor. This
>>>> feature requires a certain mix, ordering and alignments in
>>>> instruction sequences to obtain the expected hardware performance.
>>>>
>>>> I am asking the community to review this high level implementation
>>>> design and give me direction or advice.
>>>>
>>>> The new hardware issues two windows of the size N bytes of
>>>> instructions in every cycle. It goes into accelerate mode if the
>>>> windows have the right combination of instructions or alignments. Our
>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>> alignments.
>>>>
>>>> Here is a summary of the most important requirements:
>>>>
>>>> a) Maximum of N instructions per window.
>>>> b) An instruction may cross the first window.
>>>> c) Each window can have maximum of x memory loads and y memory
>>>>    stores .
>>>> d) The total number of immediate constants in the instructions
>>>>    of a window should not exceed k.
>>>> e) The first window must be aligned on 16 byte boundary.
>>>> f) A Window set terminates when a branch exists in a window.
>>>> g) The number of allowed prefixes varies for instructions.
>>>> h) A window set needs to be padded by prefixes in instructions
>>>>    or terminated by nops to ensure adherence to the rules.
>>>>
>>>> We have the following implementation plan for GCC:
>>>>
>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>    once before and once after register allocation as usual. In both
>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>    instruction scheduling.
>>>>
>>>> The advantage of doing it before register allocation is avoiding
>>>> extra dependencies caused by register allocation which may become
>>>> an obstacle to movement of instructions.  The advantage of doing
>>>> it after register allocation is a consideration for spilling code
>>>> which may be generated by the register allocator.
>>>>
>>>> The algorithm we use is:
>>>>
>>>> a) Considering the current dispatch window set, choose the first
>>>>    instruction from ready queue that does not violate dispatch rules.
>>>> b) When an instruction is selected and scheduled, inform the
>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>    instruction content of windows for future evaluation. It also manages
>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>
>>>> 2) Insertion of alignment code.
>>>>
>>>> In x86 alignment is done by inserting prefixes or by generating
>>>> nops. As the object code is generated by the assembler in GCC, some
>>>> information such as sizes of branches are unknown until assembly or
>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>> we need to iteratively compute prefixes and branch sizes until
>>>> its convergence. This pass currently does not exist in GCC, but it
>>>> exists in the assembler.
>>>>
>>>> There are two possible approaches to solve alignment problem.
>>>>
>>>> a)  Let the assembler performs the alignments and padding needed
>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>     pass in GCC.
>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>     the assembly listing in GCC and insert the required alignments.
>>>>
>>>> I appreciate your comments on the proposed implementation procedure
>>>> and the choices a or b above.
>>>>
>>
>> I don't this should be done in assembler. Assembler should just assemble
>> the assembly input.
>
> That adds quite a bit of complication to the compiler though -- getting the
> instruction lengths right (and thus proper packing & alignment) can be
> extremely difficult.  I did some experiments with this on a target with
> *fixed* instruction lengths a while back and even though the port tried hard
> to get lengths right, it would routinely miss something.  Ultimately I
> decided that it forcing the compiler to know instruction lengths with a very
> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
> instruction lengths just adds yet another complexity to the situation.  Then
> add the complication of needing to add specific prefixes or nops and it just
> gets downright ugly.
>
> I'd probably approach this by having the compiler emit a directive which
> states what the desired alignment at a particular point should be, then
> allow the assembler to select the best method to get the desired alignment.

Jeff,

This is exactly part of our binutils side of the proposal, which I'll
outline now

1. Allow multiple prefixes for ADDR and DS (and possibly others)
a) multiple prefixes are benign in certain modes and are thus chosen for padding
b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
more explicit (and they don't trigger a call to
md_flush_pending_output)

2. Add new pseudo-op to delineate alignment boundaries.  This is
needed to signal any dispatch engine (below) to pad.  Here are my top
two candidates, any feedback is appreciated:
a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
b) ".padalign" which calla a new "md_pad_align()"

3. Add dispatch optimization infrastructure which
a) is guarded by -mtune flag (and possibly other -f style flags)
b) tracks assembled instruction attributes and their fragments
c) can pad (insert benign prefixes) into previously assembled fragments
d) maintains dispatch engine state (according to some subset of Reza's rules)

Discussion:

The flags in 3a) should guard against these changes affecting current behavior.

The assembly tracking in 3b) is for bookkeeping only; the padding in
3c) would only occur when a compiler uses the pseudo-op in 2) or when
the dispatch engine in 3d) signals.

For compilers that know exactly how to pad for the new processor, the
ability to
pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.

For assembly programs and/or compilers that don't choose to do any
dispatch optimization, it's anticipated that the engine in 3d) would
be useful for optimizing for -mtune=bdver1

I'll post patches for these soon.
-- 
Quentin Neill