runtime: fix panics when polling on the control channel while unloading
When the stapio pselect() runs while the given stap module is unloading,
there's a use-after-free opportunity in do_select(). This occurs because
the control channel's poll function, _stp_ctl_poll_cmd(), passes a
pointer to a global variable along to do_select(), which can then
dereference the pointer after the stap module is unloaded.
Normally, this wouldn't be a problem because do_select() uses get_file()
and fput(), which respectively grab and release references to the module
owner specified in `file->f_op->owner`. However, procfs doesn't provide
any interface to pass in a module owner, and instead all procfs files
use an internal `struct file_operations` declared in fs/proc/inode.c.
As a result, we cannot bolster procfs files with module reference
count protection through any normal means, so we must inject a module
owner the hard way.
A module owner is now patched into the control channel's file ops when
the file is opened by making a copy of the existing file ops and then
setting the module owner inside the copy, which then replaces the old
`file->f_op` pointer. This neatly fixes the race because procfs *does*
guarantee that none of the procfs callback functions are still running
after an entry is removed, and because _stp_ctl_poll_cmd() cannot be
reached without first passing through _stp_ctl_open_cmd().
Since delete_module() can now return EWOULDBLOCK, we must make staprun
aware that it's not a fatal error and that the module deletion should
be retried. EWOULDBLOCK simply indicates that a pselect() on the control
channel has yet to finish, so it will go away after a brief wait.
This fixes the following panic:
BUG: unable to handle kernel paging request at
ffffffffc0914030
PGD
79820c067 P4D
79820c067 PUD
79820e067 PMD
3f9ee6067 PTE 0
Oops: 0002 [#1] SMP PTI
CPU: 6 PID:
1636475 Comm: stapio Kdump: loaded Tainted: G OE 4.19.91-22.2.al7.x86_64 #1
RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
RSP: 0018:
ffffb9fb0e45f980 EFLAGS:
00010046
RAX:
0000000000000000 RBX:
0000000000000246 RCX:
0000000000000000
RDX:
0000000000000001 RSI:
ffffb9fb0e45faf0 RDI:
ffffffffc0914030
RBP:
ffffffffc0914030 R08:
0000000000000001 R09:
ffff973fa8924000
R10:
0000000000000104 R11:
0000000000000041 R12:
0000000000000000
R13:
ffffb9fb0e45fab0 R14:
000000000000000f R15:
000000000000000f
FS:
00007effdcf53740(0000) GS:
ffff97409fb80000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
ffffffffc0914030 CR3:
0000000522d42003 CR4:
00000000003606e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
remove_wait_queue+0x14/0x60
poll_freewait+0x37/0xa0
do_select+0x650/0x740
? compat_poll_select_copy_remaining+0x110/0x110
? kvm_sched_clock_read+0xd/0x20
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xa0
? select_idle_sibling+0x28/0x400
? account_entity_enqueue+0x9c/0xd0
? enqueue_entity+0x71f/0xc80
? __switch_to_asm+0x35/0x70
? enqueue_task_fair+0xd2/0x9b0
? remove_entity_load_avg+0x27/0x70
? check_preempt_curr+0x6b/0x90
? ttwu_do_wakeup+0x19/0x150
? try_to_wake_up+0x219/0x580
core_sys_select+0x1e2/0x320
? audit_filter_inodes+0x1f/0xf0
? audit_filter_syscall.constprop.11+0x8c/0xd0
? __audit_syscall_exit+0x1fd/0x290
? kvm_clock_get_cycles+0xd/0x10
? ktime_get_ts64+0x46/0xf0
__se_sys_pselect6+0xf6/0x1b0
do_syscall_64+0x5b/0x1b0
entry_SYSCALL_64_after_hwframe+0x44/0xa9