Bug 33943 - gas: .prefalign directive for body-size-dependent function alignment
Summary: gas: .prefalign directive for body-size-dependent function alignment
Status: NEW
Alias: None
Product: binutils
Classification: Unclassified
Component: gas (show other bugs)
Version: unspecified
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2026-03-02 04:51 UTC by Fangrui Song
Modified: 2026-03-12 17:26 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fangrui Song 2026-03-02 04:51:59 UTC
Compilers emit `.p2align 4` (or similar) before functions to align them to a preferred boundary (e.g. 16 bytes on x86-64). This is good for large functions but wasteful for small ones: a 3-byte function padded to a 16-byte boundary wastes up to 15 bytes — 500% overhead.

LLVM recently introduced a new assembler directive `.prefalign`, that computes alignment value based on the size of the section.

```
#### current syntax with surprising behavior; to be revised
.section .text.f,"ax",@progbits
.p2align 2
# 3-byte function body: aligned to std::bit_ceil(3) = 4
.prefalign 16
f:
  ret
  ret
  ret

.section .text.g,"ax",@progbits
.p2align 2
# 16-byte function body: no-op
.prefalign 16
f:
  .space 16
```

The implementation does not actually align the current location, but simply increases the section alignment (ELF sh_addralign).
I find this behavior surprising and propose the following revision:

.prefalign <pref_align>, <end_sym>, nop
.prefalign <pref_align>, <end_sym>, <fill_byte>

- `pref_align`: the preferred (maximum) alignment, must be a power of 2
- `end_sym`: a symbol marking the end of the code body
- Third operand: `nop` for target-appropriate (variable-size) NOP fill, or an integer byte value `[0, 255]`

The assembler computes `body_size = end_sym - (directive_location + padding)` during relaxation and determines the alignment:

- `body_size < pref_align`: align to `std::bit_ceil(body_size)` (the smallest integral power of two that is not smaller than `body_size`). The alignment is 1 for a body_size of 0 or 1.
- `body_size >= pref_align`: align to `pref_align`

The rationale for the small-body rule: if the cache block size is 64 and the goal is to minimize cache block crossings, aligning to `min(64, bit_ceil(body_size))` is the minimum alignment that prevents an unnecessary boundary crossing.
For example, a 12-byte function aligned to bit_ceil(12) = 16 cannot straddle a 64-byte boundary.
A 3-byte function aligned to bit_ceil(3) = 4 cannot straddle a 64-byte boundary.

To enforce a minimum alignment independently, users emit both `.p2align` and `.prefalign`.

**Prior art**

GCC's `-flimit-function-alignment` partially addresses this by capping the `.p2align` max-skip operand based on function size. However, the max-skip operand is evaluated at parse time, so it cannot reference a forward label:

```asm
# Not supported — forward reference in max-skip
.p2align 4, , end - start
start:
  nop
end:
```

Even with max-skip expressions, the directive fails to achieve proportional alignment. This forces a premature size calculation by the compiler that ignores assembler-side adjustments (e.g., span-dependent instruction relaxation).


**Example**

```asm
.section .text.f,"ax",@progbits
.prefalign 16, .Lf1_end, nop
# 3-byte function body: aligned to std::bit_ceil(3) = 4
nop
nop
nop
.Lf1_end:
.prefalign 16, .Lf2_end, nop
# 32-byte function body: aligned to 16
...
.Lf2_end:
```

**Implementation notes**

- The directive creates a new fragment type whose size is determined iteratively during the relaxation loop, with the body-size-dependent rule.
- The fill operand is required to make the intent explicit (NOP fill for code, zero/byte fill for data).
- For targets with linker relaxation (e.g. RISC-V), `.prefalign` padding is fully resolved at assembly time and does not require `R_RISCV_ALIGN`-style relocations.

**References**

- LLVM RFC: https://discourse.llvm.org/t/rfc-enhancing-function-alignment-attributes/88019
- Blog post with detailed analysis: https://maskray.me/blog/2025-08-24-understanding-alignment-from-source-to-object-file
Comment 1 Jan Beulich 2026-03-12 07:48:01 UTC
(In reply to Fangrui Song from comment #0)
> Compilers emit `.p2align 4` (or similar) before functions to align them to a
> preferred boundary (e.g. 16 bytes on x86-64). This is good for large
> functions but wasteful for small ones: a 3-byte function padded to a 16-byte
> boundary wastes up to 15 bytes — 500% overhead.

Hmm, for me 16 - 3 = 13.

As you're considering compilers, wouldn't such very small functions generally best be inlined? And even if not, as that's not possible when e.g. a library has to provide an implementation, won't compilers' size estimates for extremely small functions generally be correct?

> .prefalign <pref_align>, <end_sym>, nop
> .prefalign <pref_align>, <end_sym>, <fill_byte>
> 
> - `pref_align`: the preferred (maximum) alignment, must be a power of 2

Why not make it .p2align-like, requiring the power of 2 to be specified?

> - `end_sym`: a symbol marking the end of the code body
> - Third operand: `nop` for target-appropriate (variable-size) NOP fill, or
> an integer byte value `[0, 255]`

This looks x86-centric. The padding in code may want to be something else than NOP, yet that can't be specified by <fill_byte> if insn size / granularity is larger than a byte. This would need to be a fill pattern of (at least) insn granularity size.

> To enforce a minimum alignment independently, users emit both `.p2align` and
> `.prefalign`.

I consider the need to use two directives as problematic. What if they're not sitting back to back?

And the - how is this going to work for targets aiming at mainly link-time relaxation (RISC-V for example)?
Comment 2 Fangrui Song 2026-03-12 17:26:04 UTC
(In reply to Jan Beulich from comment #1)
> (In reply to Fangrui Song from comment #0)
> > Compilers emit `.p2align 4` (or similar) before functions to align them to a
> > preferred boundary (e.g. 16 bytes on x86-64). This is good for large
> > functions but wasteful for small ones: a 3-byte function padded to a 16-byte
> > boundary wastes up to 15 bytes — 500% overhead.
> 
> Hmm, for me 16 - 3 = 13.

If the current location is 1 mod 16, .p2align 4 advances the location to 0 mod
16, requiring a 15-byte padding.

> As you're considering compilers, wouldn't such very small functions
> generally best be inlined? And even if not, as that's not possible when e.g.
> a library has to provide an implementation, won't compilers' size estimates
> for extremely small functions generally be correct?

Small functions exist in practice more often than one might expect.
For example, C++ virtual tables take function addresses, making many small functions necessary.

The original author of the .prefalign directive in LLVM also designed LLVM's LTO-based Control-flow Integrity. CFI can result in many more small functions.

To enforce integrity, an indirect call like `call *fptr` is replaced by a protective stub. This stub performs two primary actions:

* verifies that the target address is a member of the valid target set and converts that address into a small index.
* dereferences the specific jump table entry and jumps to the target.

This produces many stubs, appearing either as simple relays (foo_jt: jmp foo) or as fully inlined versions of the target function (foo_jt: <inlined-body-of-foo>).

Why foo_jt can't just be foo: foo's address may not be guaranteed to fall inside the jump table's contiguous region. For example, foo can appear in multiple indirect call sites.

This proliferation of jump stubs was the primary motivation for the author to introduce the ld.lld --branch-to-branch optimization https://github.com/llvm/llvm-project/pull/138366

> > .prefalign <pref_align>, <end_sym>, nop
> > .prefalign <pref_align>, <end_sym>, <fill_byte>
> > 
> > - `pref_align`: the preferred (maximum) alignment, must be a power of 2
> 
> Why not make it .p2align-like, requiring the power of 2 to be specified?

I'm fine with

.prefalign <log2_align>, <end_sym>, nop

to save a logarithm in the assembler and be consistent with the popular .p2align directive.

> 
> > - `end_sym`: a symbol marking the end of the code body
> > - Third operand: `nop` for target-appropriate (variable-size) NOP fill, or
> > an integer byte value `[0, 255]`
> 
> This looks x86-centric. The padding in code may want to be something else
> than NOP, yet that can't be specified by <fill_byte> if insn size /
> granularity is larger than a byte. This would need to be a fill pattern of
> (at least) insn granularity size.

The byte fill is for x86 int3 and for non-code uses (almost always 0).

We can require

.prefalign 4, end, 00
.prefalign 4, end, cc

Then if multi-byte fills are ever needed,

.prefalign 4, end, 0001
.prefalign 4, end, 00010203

> > To enforce a minimum alignment independently, users emit both `.p2align` and
> > `.prefalign`.
> 
> I consider the need to use two directives as problematic. What if they're
> not sitting back to back?

The two are composable: .p2align guarantees a floor, .prefalign adds size-proportional
  alignment on top.
Merging them into one directive would either duplicate .p2align's functionality or complicate the semantics.
The separation mirrors how latest LLVM thinks about alignment: align(N) (minimum, mandatory) vs prefalign(N) (preferred,
size-dependent).

> And the - how is this going to work for targets aiming at mainly link-time
> relaxation (RISC-V for example)?

.prefalign's forward reference to end_sym is resolved within the assembler's own iterative relaxation, not deferred to the linker.
If the linker would shrink the instructions (e.g. auipc+jalr -> jal), the assembler-inserted padding will remain.
I think RISC-V and LoongArch derive slightly less benefit from .prefalign directive.