This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH resend] MIPS: Allow FPU emulator to use non-stack area.


On 10/06/2014 01:54 PM, Rich Felker wrote:
On Mon, Oct 06, 2014 at 01:23:30PM -0700, David Daney wrote:
From: David Daney <david.daney@cavium.com>

In order for MIPS to be able to support a non-executable stack, we
need to supply a method to specify a userspace area that can be used
for executing emulated branch delay slot instructions.

We add a new system call, sys_set_fpuemul_xol_area so that userspace
threads that are using the FPU can specify the location of the FPU
emulation out of line execution area.

Background:

MIPS floating point support requires that any instruction that cannot
be directly executed by the FPU, be emulated by the kernel.  Part of
this emulation involves executing non-FPU instructions that fall in
the delay slots of FP branch instructions.  Since the beginning of
MIPS/Linux time, this has been done by placing the instructions on the
userspace thread stack, and executing them there, as the instructions
must be executed in the MM context of the thread receiving the
emulation.

Because of this, the de facto MIPS Linux userspace ABI requires that
the userspace thread have an executable stack.  It is de facto,
because it is not written anywhere that this must be the case, but it
is never the less a requirement.

Problem:

How do we get MIPS Linux to use a non-executable stack in the face of
the FPU emulation problem?

Since userspace desires to change the ABI, put some of the onus on the
userspace code.  Any userspace thread desiring a non-executable stack,
must allocate a 4-byte aligned area at least 8 bytes long with that
has read/write/execute permissions and pass the address of that area
to the kernel with the new sys_set_fpuemul_xol_area system call.

This is similar to how we require userspace to notify the kernel of
the value of the thread local pointer.

Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation.

That is certainly one way of looking at it. Really it is opinion, rather than fact though.

GLibc is full of code (see ld.so) that in earlier incantations of Unix/Linux was in kernel space, and was moved to userspace. Given that there is a partitioning of code between kernel space and userspace, I think it not totally unreasonable to consider doing some of this in userspace.

Even on systems with hardware FPU, the architecture specification allows for/requires emulation of certain cases (denormals, etc.) So it is already a requirement that userspace cooperate by always having free space below $SP for use by the kernel. So the current situation is that userspace is providing services for the kernel FPU emulator.

My suggestion is to change the nature of the way these services are provided by the userspace program.

The kernel is perfectly capable of mapping
an appropriate page. The mapping should happen at exec time,  and at
clone time with CLONE_VM

Why? This adds overhead for threads that don't use the FPU. So this suggestion adds at least one page of memory overhead for each thread in the system (unless I misunderstand what you are saying).

unless the kernel is going to handle mutual
exclusion so that only one thread can be using the page at a time.
(Using one page for the whole process, and excluding simultaneous
execution of fpu emulation in multiple threads, may be the more
practical approach.)

As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.


Yes, there are 2^32 possible instructions. Each one is 4 bytes, plus you need a way to exit after the instruction has executed, which would require another instruction. So you would need 32GB of memory to hold all those instructions, larger than the 32-bit virtual address space.

Rich



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]