This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] nptl: change default stack guard size of threads


On 12/12/2017 12:49 AM, Jeff Law wrote:

No worries.  Richard E. can give you the background on the AArch64 side
of things.  I'll try to channel Richard's request. If I over-simplify or
mis-state something, Richard's view should be taken as "the truth".

[snip description of architectures where probing is “cheap”]

The closest thing we have on aarch64 is that the caller must have saved
LR into its stack.  But we have no idea the difference between where the
caller saved LR and the value of the stack pointer in the callee.

Thus to fully protect AArch64 we would have to emit many more probes
than on other architectures because we have to make worst case
assumptions at function entry.  This cost was deemed too high.

The key is that the outgoing argument space sits between the caller's LR
slot and the callee's frame.  So it was decided that certain
requirements would be placed on the caller and that the callee would be
able to make certain assumptions WRT whether or not the caller would
write into the outgoing argument area.

Thanks for posting this summary.

After analysis (of spec2k6 I believe) and review of the kernel's
guarantees WRT the guard size it was decided that the size of the guard
on aarch64 would be 64k.  That corresponds to 2 pages for a RHEL kernel.

1 page on Red Hat Enterprise Linux, I think. At least “getconf PAGE_SIZE” returns 65536, and I hope that's accurate. 8-)

(There are also some folks who assume that changing the kernel page size wouldn't be an ABI change at this point, something which I do not agree with.)

Or do you suggest to a 128 KiB guard area to make the current aarch64 probing sequence safe in the presence of asynchronous signals?

It corresponds to 16 pages on a Fedora kernel.

Right.

And we can't switch to 64 KiB pages there because apparently, video memory sharing on the Raspberry Pi 3 would then need 512 MiB of RAM (not just address space), which is not an option for such a small device.

The caller would be responsible for ensuring that it always would
write/probe within 1k of the limit of its stack.  Thus the callee would
be able to allocate up to 63k without probing.  This essentially brings
the cost of probing down into the noise on AArch64.

(I assume this has been re-assessed after the change not to probe below the stack pointer (which might have added some extra cost).)

I'm a bit surprised that the 1K/3K split wouldn't achieve that (i.e., pushing the cost of probing into he noise). Is this because GCC wasn't built for this and has no way to recognize implicit probes which occur throughout the regular execution of a function? Or is the concern that we might skip the guard region if a single arrives at an inopportune moment? (But gcc-4.8.5-25.el7.aarch64 is still not async-signal-safe because it decreases SP by 67 KiB before starting probing.)

Once probing, we probe at 4k intervals.  That fits nicely into the 12bit
shifted immediates available on aarch64.  In theory a larger probing
interval would reduce the cost of probing, but you'd have to twiddle the
sequences in the target files to get a scratch register in places were
they don't right now.

We all agreed that there's a bit of a hole where unprotected code
calling protected code could leave the stack pointer somewhere in the
guard page on aarch64 and be a vector for attack.  However, it's more
likely that if those scenarios that the attacker has enough control on
the caller side that they'd just jump the guard in the caller.

Such assumptions still make sense to me.

So that's why things are the way they are.  Again, if I've gotten
something wrong, I apologize to you and Richard :-)


Likewise, I have no real desire for us to emit a bunch of extra operations
if we're not required to for glibc.
Agreed.  I think we all want probing to be low enough overhead that we
just enable it by default everywhere to get protected and nobody notices.

If assuming that 64k probes are sufficient on AArch64 is not going to allow
us a correct implementation, then we can't assume 64k probes on AArch64. My
understanding was that we were safe in this as the kernel was giving us a
generous 1MB to play with, and we could modify glibc to also give us 64k
(I admit, I had not considered ILP32, where you've rightly pointed out we
will eat lots of address space if we make this decision).

Richard E. explicitly took ILP32 off the table during out discussion.

The patch which started this libc-alpha thread applied the 64 KiB gap size to 32-bit architectures as well. This is the primary reason why I'm objecting strongly to it.

If aarch64 wants to do their own thing in 64-bit mode, I'm won't complain that much.

GCC needs to emit probe intervals for the smallest supported page size
on the the target architecture.  If it does not do that, we end up in
trouble on the glibc side.

This is where I may have a misunderstanding, why would it require probing
at the smallest page size, rather than probing at a multiple of the guard
size? It is very likely I'm missing something here as I don't know the glibc
side of this at all.

I'm not sure where that statement comes from either.  I guess the
concern is someone could boot a kernel with a smaller page size and
perhaps the kernel/glibc create their guards based on # pages rather
than absolute size.  Thus booting a kernel with a smaller pagesize would
code with less protection.

The existing ecosystem offers only one page size. The larger main stack guard region size provided by the kernel was a stop-gap measure, initially with the intent to patch nothing else, but it did not work out that way. I don't think 64 KiB would make a significant dent in terms of practical exploitability (all published exploits used larger frames or alloca-based frames anyway).

If GCC assumes more than one guard page, a lot of things need to change, and it is difficult to communicate under what conditions a binary has been properly hardened. If this is the cost for enabling -fstack-clash-protection by default on aarch64, then so be it, but we should not make these changes on architectures where they do not bring any tangible benefit or actually hurt (due to address space consumption).

Thanks,
Florian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]