Scheduling x86 dispatch windows

Thu Jun 10 21:08:00 GMT 2010

On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
<quentin.neill.gnu@gmail.com> wrote:
> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <law@redhat.com> wrote:
>> On 06/10/10 13:52, H.J. Lu wrote:
>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>> <quentin.neill.gnu@gmail.com>  wrote:
>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>> is relevant - s ee the last few paragraphs regarding how to
>>>> "solve the alignment problem".
>>>>
>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>
>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_reza@yahoo.com>
>>>>  wrote:
>>>>> Hi,
>>>>>
>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>> feature requires a certain mix, ordering and alignments in
>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>
>>>>> I am asking the community to review this high level implementation
>>>>> design and give me direction or advice.
>>>>>
>>>>> The new hardware issues two windows of the size N bytes of
>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>> windows have the right combination of instructions or alignments. Our
>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>> alignments.
>>>>>
>>>>> Here is a summary of the most important requirements:
>>>>>
>>>>> a) Maximum of N instructions per window.
>>>>> b) An instruction may cross the first window.
>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>    stores .
>>>>> d) The total number of immediate constants in the instructions
>>>>>    of a window should not exceed k.
>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>> f) A Window set terminates when a branch exists in a window.
>>>>> g) The number of allowed prefixes varies for instructions.
>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>    or terminated by nops to ensure adherence to the rules.
>>>>>
>>>>> We have the following implementation plan for GCC:
>>>>>
>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>>    once before and once after register allocation as usual. In both
>>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>>    instruction scheduling.
>>>>>
>>>>> The advantage of doing it before register allocation is avoiding
>>>>> extra dependencies caused by register allocation which may become
>>>>> an obstacle to movement of instructions.  The advantage of doing
>>>>> it after register allocation is a consideration for spilling code
>>>>> which may be generated by the register allocator.
>>>>>
>>>>> The algorithm we use is:
>>>>>
>>>>> a) Considering the current dispatch window set, choose the first
>>>>>    instruction from ready queue that does not violate dispatch rules.
>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>>    instruction content of windows for future evaluation. It also manages
>>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>>
>>>>> 2) Insertion of alignment code.
>>>>>
>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>> information such as sizes of branches are unknown until assembly or
>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>> exists in the assembler.
>>>>>
>>>>> There are two possible approaches to solve alignment problem.
>>>>>
>>>>> a)  Let the assembler performs the alignments and padding needed
>>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>>     pass in GCC.
>>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>>     the assembly listing in GCC and insert the required alignments.
>>>>>
>>>>> I appreciate your comments on the proposed implementation procedure
>>>>> and the choices a or b above.
>>>>>
>>>
>>> I don't this should be done in assembler. Assembler should just assemble
>>> the assembly input.
>>
>> That adds quite a bit of complication to the compiler though -- getting the
>> instruction lengths right (and thus proper packing & alignment) can be
>> extremely difficult.  I did some experiments with this on a target with
>> *fixed* instruction lengths a while back and even though the port tried hard
>> to get lengths right, it would routinely miss something.  Ultimately I
>> decided that it forcing the compiler to know instruction lengths with a very
>> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
>> instruction lengths just adds yet another complexity to the situation.  Then
>> add the complication of needing to add specific prefixes or nops and it just
>> gets downright ugly.
>>
>> I'd probably approach this by having the compiler emit a directive which
>> states what the desired alignment at a particular point should be, then
>> allow the assembler to select the best method to get the desired alignment.
>
> Jeff,
>
> This is exactly part of our binutils side of the proposal, which I'll
> outline now
>
> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
> a) multiple prefixes are benign in certain modes and are thus chosen for padding
> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
> more explicit (and they don't trigger a call to
> md_flush_pending_output)
>
> 2. Add new pseudo-op to delineate alignment boundaries.  This is
> needed to signal any dispatch engine (below) to pad.  Here are my top
> two candidates, any feedback is appreciated:
> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
> b) ".padalign" which calla a new "md_pad_align()"
>
> 3. Add dispatch optimization infrastructure which
> a) is guarded by -mtune flag (and possibly other -f style flags)
> b) tracks assembled instruction attributes and their fragments
> c) can pad (insert benign prefixes) into previously assembled fragments
> d) maintains dispatch engine state (according to some subset of Reza's rules)
>
> Discussion:
>
> The flags in 3a) should guard against these changes affecting current behavior.
>
> The assembly tracking in 3b) is for bookkeeping only; the padding in
> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
> the dispatch engine in 3d) signals.
>
> For compilers that know exactly how to pad for the new processor, the
> ability to
> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>
> For assembly programs and/or compilers that don't choose to do any
> dispatch optimization, it's anticipated that the engine in 3d) would
> be useful for optimizing for -mtune=bdver1
>
> I'll post patches for these soon.

Can you do it with directives only?

-- 
H.J.