[PATCH 2/3] Linux: Use rseq in sched_getcpu if available (v9)

Mon Jul 6 17:33:46 GMT 2020

----- On Jul 6, 2020, at 10:49 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Jul 6, 2020, at 9:59 AM, Florian Weimer fweimer@redhat.com wrote:
> 
>> * Mathieu Desnoyers:
>> 
>>> When available, use the cpu_id field from __rseq_abi on Linux to
>>> implement sched_getcpu().  Fall-back on the vgetcpu vDSO if
>>> unavailable.
>> 
>> I've pushed this to glibc master, but unfortunately it looks like this
>> exposes a kernel bug related to affinity mask changes.
>> 
>> After building and testing glibc, this
>> 
>>  for x in {1..2000} ; do posix/tst-affinity-static  & done
>> 
>> produces some “error:” lines for me:
>> 
>> error: Unexpected CPU 2, expected 0
>> error: Unexpected CPU 2, expected 0
>> error: Unexpected CPU 2, expected 0
>> error: Unexpected CPU 2, expected 0
>> error: Unexpected CPU 138, expected 0
>> error: Unexpected CPU 138, expected 0
>> error: Unexpected CPU 138, expected 0
>> error: Unexpected CPU 138, expected 0
>> 
>> “expected 0” is a result of how the test has been written, it bails out
>> on the first failure, which happens with CPU ID 0.
>> 
>> Smaller systems can use a smaller count than 2000 to reproduce this.  It
>> also happens sporadically when running the glibc test suite itself
>> (which is why it took further testing to reveal this issue).
>> 
>> I can reproduce this with the Debian 4.19.118-2+deb10u1 kernel, the
>> Fedora 5.6.19-300.fc32 kernel, and the Red Hat Enterprise Linux kernel
>> 4.18.0-193.el8 (all x86_64).
>> 
>> As to the cause, I'd guess that the exit path in the sched_setaffinity
>> system call fails to update the rseq area, so that userspace can observe
>> the outdated CPU ID there.
> 
> Hi Florian,
> 
> We have a similar test in Linux, see tools/testing/selftests/rseq/basic_test.c.
> That test does not trigger this issue, even when executed repeatedly.
> 
> I'll investigate further what is happening within the glibc test.

Here are the missing kernel bits. Just adding those in wake_up_new_task() is
enough to make the problem go away, but to err on the safe side I'm planning to
add an rseq_migrate within sched_fork:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1757381cabd4..ff83f790a9b3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3043,6 +3043,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
         * so use __set_task_cpu().
         */
        __set_task_cpu(p, smp_processor_id());
+       rseq_migrate(p);
        if (p->sched_class->task_fork)
                p->sched_class->task_fork(p);
        raw_spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -3103,6 +3104,7 @@ void wake_up_new_task(struct task_struct *p)
         */
        p->recent_used_cpu = task_cpu(p);
        __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+       rseq_migrate(p);
 #endif
        rq = __task_rq_lock(p, &rf);
        update_rq_clock(rq);

I'm not sure why your test catches it fairly quickly but ours does not. That's a good
catch.

Now we need to discuss how we introduce that fix in a way that will allow user-space
to trust the __rseq_abi.cpu_id field's content.

The usual approach to kernel bug fixing is typically to push the fix, mark it for
stable kernels, and expect everyone to pick up the fixes. I wonder how comfortable
glibc would be to replace its sched_getcpu implementation with a broken-until-fixed
kernel rseq implementation without any mechanism in place to know whether it can
trust the value of the cpu_id field. I am extremely reluctant to do so.

One possible way to introduce this fix would be to use the rseq flags argument to allow
glibc to query whether the kernel implements a rseq with a cpu_id field it can trust.
So glibc could do, for registration:

  ret = INTERNAL_SYSCALL_CALL (rseq, &__rseq_abi, sizeof (struct rseq),
                               RSEQ_FLAG_REGISTER | RSEQ_FLAG_RELIABLE_CPU_ID,
                               RSEQ_SIG);

(I'm open to bike-shedding the actual flag name)

So if the kernel does not implement the fix, the registration would return EINVAL.
When getting EINVAL like this, we could then re-issue the rseq syscall:

  ret = INTERNAL_SYSCALL_CALL (rseq, NULL, 0, RSEQ_FLAG_RELIABLE_CPU_ID, 0);

to check whether EINVAL has been caused by other invalid system call parameters,
or it's really because the RSEQ_FLAG_RELIABLE_CPU_ID flag is unknown. Being able
to distinguish between invalid parameters and unknown flags like this would end
up requiring one extra system call on failed registration attempt on kernels which
implement a broken rseq.

This also means glibc would only start really activating rseq on kernels implementing
this fix.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com