Bug 30777 - Systemtap modules unable to run on systemtap supporting Intel CET IBT
Summary: Systemtap modules unable to run on systemtap supporting Intel CET IBT
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-17 15:37 UTC by William Cohen
Modified: 2023-08-29 15:53 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
A sample module demonstrating a workaround approach (1009 bytes, application/gzip)
2023-08-22 15:26 UTC, William Cohen
Details
Partial implementation for the kallsyms_* indirect calls in runtime/sym.c (1.35 KB, patch)
2023-08-24 19:10 UTC, William Cohen
Details | Diff
This patch works well enough on Intel IBT machine to allow testing of the examples (3.05 KB, patch)
2023-08-27 23:40 UTC, William Cohen
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description William Cohen 2023-08-17 15:37:51 UTC
When attempting to run systemtap on Intel 11th generation processors where the kernel has IBT (Indirect Branch Target) support the kernel will trap instrumentations call to kallsysms_lookup_name and the module will fail to run.  On can recreate this with the trivial:

stap -ve 'probe begin{printf("hello\n")}'

And one will see output in the dmesg output talking about the "Missing ENDBR" in kallsyms_lookup_name.  One can disable the IBT support by adding "ibt=off" or "clearcpuid=596" to the kernel boot parameters and that allows the systemtap scripts to run.  Below is the dmesg output for the reproducer above.

[72701.193840] kallsyms_lookup_name is ffffffff81206980
[72701.193844] traps: Missing ENDBR: kallsyms_lookup_name+0x0/0xd0
[72701.193850] ------------[ cut here ]------------
[72701.193850] kernel BUG at arch/x86/kernel/traps.c:257!
[72701.193854] invalid opcode: 0000 [#2] PREEMPT SMP NOPTI
[72701.193855] CPU: 4 PID: 31078 Comm: stapio Tainted: P      D    OE      6.4.10-200.fc38.x86_64 #1
[72701.193857] Hardware name: LENOVO 20Y4S1QE00/20Y4S1QE00, BIOS N40ET41W (1.23 ) 05/11/2023
[72701.193858] RIP: 0010:exc_control_protection+0xb8/0xc0
[72701.193861] Code: 48 8b 93 80 00 00 00 be fe 00 00 00 48 c7 c7 86 37 85 82 e8 1a 47 17 ff e9 7b ff ff ff 48 c7 43 50 00 00 00 00 e9 6e ff ff ff <0f> 0b 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[72701.193879] RSP: 0018:ffffb466a81d7d08 EFLAGS: 00010002
[72701.193881] RAX: 0000000000000033 RBX: ffffb466a81d7d28 RCX: 0000000000000027
[72701.193882] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff99191f521540
[72701.193883] RBP: 0000000000000003 R08: 0000000000000000 R09: ffffb466a81d7bb0
[72701.193883] R10: 0000000000000003 R11: ffffffff83146508 R12: 0000000000000000
[72701.193884] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[72701.193885] FS:  00007f2ff1fcb040(0000) GS:ffff99191f500000(0000) knlGS:0000000000000000
[72701.193886] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[72701.193887] CR2: 0000000000417850 CR3: 000000033905e006 CR4: 0000000000f70ee0
[72701.193888] PKRU: 55555554
[72701.193889] Call Trace:
[72701.193891]  <TASK>
[72701.193892]  ? die+0x36/0x90
[72701.193894]  ? do_trap+0xda/0x100
[72701.193895]  ? exc_control_protection+0xb8/0xc0
[72701.193897]  ? do_error_trap+0x6a/0x90
[72701.193898]  ? exc_control_protection+0xb8/0xc0
[72701.193899]  ? exc_invalid_op+0x50/0x70
[72701.193900]  ? exc_control_protection+0xb8/0xc0
[72701.193901]  ? asm_exc_invalid_op+0x1a/0x20
[72701.193905]  ? exc_control_protection+0xb8/0xc0
[72701.193906]  ? exc_control_protection+0x6e/0xc0
[72701.193907]  asm_exc_control_protection+0x26/0x30
[72701.193909] RIP: 0010:kallsyms_lookup_name+0x0/0xd0
[72701.193912] Code: 79 0a 48 f7 d0 48 03 05 d6 41 5b 01 c3 cc cc cc cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <66> 0f 1f 00 0f 1f 44 00 00 53 48 83 ec 10 65 48 8b 04 25 28 00 00
[72701.193913] RSP: 0018:ffffb466a81d7dd0 EFLAGS: 00010246
[72701.193915] RAX: ffffffff81206980 RBX: ffffffffc193c4ed RCX: 0000000000000000
[72701.193916] RDX: 0000000000000000 RSI: ffff99191f521540 RDI: ffffffffc193c4b3
[72701.193917] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffb466a81d7c80
[72701.193918] R10: 0000000000000003 R11: ffffffff83146508 R12: 0000000000000000
[72701.193919] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[72701.193923]  ? __pfx_kallsyms_lookup_name+0x10/0x10
[72701.193926]  kallsyms_lookup_name+0x38/0x80 [stap_e6e398b6d5dd95e72807a5b0212b03_31078]
[72701.193933]  _stp_ctl_write_cmd+0x462/0xb70 [stap_e6e398b6d5dd95e72807a5b0212b03_31078]
[72701.193937]  ? inode_security+0x22/0x60
[72701.193940]  proc_reg_write+0x57/0xa0
[72701.193943]  vfs_write+0xe5/0x3f0
[72701.193946]  ? __x64_sys_rt_sigprocmask+0x83/0xe0
[72701.193948]  ? syscall_exit_to_user_mode+0x1b/0x40
[72701.193951]  ? do_syscall_64+0x6c/0x90
[72701.193953]  ? __fget_light+0x99/0x100
[72701.193956]  ksys_write+0x6f/0xf0
[72701.193957]  do_syscall_64+0x5d/0x90
[72701.193959]  ? exc_page_fault+0x7f/0x180
[72701.193961]  entry_SYSCALL_64_after_hwframe+0x77/0xe1
[72701.193963] RIP: 0033:0x7f2ff20cd19d
[72701.193983] Code: e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 f8 78 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 45 f8 e8 4f 79 f8 ff 48 8b
[72701.193984] RSP: 002b:00007ffcab001750 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[72701.193985] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f2ff20cd19d
[72701.193986] RDX: 000000000000000c RSI: 00007ffcab001780 RDI: 0000000000000004
[72701.193986] RBP: 00007ffcab001770 R08: 0000000000000000 R09: 00007ffcab000947
[72701.193987] R10: 0000000000000008 R11: 0000000000000293 R12: 00007ffcab001be0
[72701.193988] R13: 0000000000000000 R14: 0000000000000001 R15: 00007ffcab001c64
[72701.193989]  </TASK>
[72701.193990] Modules linked in: stap_e6e398b6d5dd95e72807a5b0212b03_31078(OE) hellokernel(POE+) tls rfcomm snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc vfat fat snd_ctl_led snd_soc_skl_hda_dsp snd_soc_intel_hda_dsp_common snd_soc_hdac_hdmi snd_sof_probes iwlmvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_soc_dmic mac80211 snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof libarc4 snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp snd_compress ac97_bus kvm_intel snd_pcm_dmaengine snd_hda_intel kvm snd_intel_dspcfg
[72701.194017]  snd_intel_sdw_acpi snd_hda_codec uvcvideo iwlwifi mei_pxp iTCO_wdt btusb mei_hdcp snd_hda_core mei_wdt btrtl uvc videobuf2_vmalloc btbcm videobuf2_memops intel_pmc_bxt videobuf2_v4l2 snd_hwdep irqbypass btintel videobuf2_common btmtk rapl ee1004 snd_seq thinkpad_acpi iTCO_vendor_support intel_rapl_msr intel_cstate videodev cfg80211 mei_me snd_seq_device processor_thermal_device_pci_legacy ledtrig_audio bluetooth snd_pcm mc intel_uncore think_lmi processor_thermal_device firmware_attributes_class pcspkr platform_profile mei i2c_i801 processor_thermal_rfim thunderbolt snd_timer i2c_smbus idma64 wmi_bmof processor_thermal_mbox rfkill processor_thermal_rapl intel_rapl_common intel_soc_dts_iosf snd int3403_thermal soundcore int340x_thermal_zone int3400_thermal acpi_thermal_rel acpi_pad acpi_tad joydev loop zram dm_crypt i915 nvme rtsx_pci_sdmmc i2c_algo_bit drm_buddy mmc_core drm_display_helper nvme_core cec crct10dif_pclmul ucsi_acpi crc32_pclmul hid_multitouch crc32c_intel polyval_clmulni polyval_generic
[72701.194050]  ghash_clmulni_intel rtsx_pci typec_ucsi sha512_ssse3 ttm typec nvme_common i2c_hid_acpi i2c_hid video wmi pinctrl_tigerlake serio_raw ip6_tables ip_tables fuse
[72701.194056] Unloaded tainted modules: hellokernel(POE):2 [last unloaded: hellokernel(POE)]
[72701.194059] ---[ end trace 0000000000000000 ]---
[72701.194060] RIP: 0010:exc_control_protection+0xb8/0xc0
[72701.194061] Code: 48 8b 93 80 00 00 00 be fe 00 00 00 48 c7 c7 86 37 85 82 e8 1a 47 17 ff e9 7b ff ff ff 48 c7 43 50 00 00 00 00 e9 6e ff ff ff <0f> 0b 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[72701.194062] RSP: 0018:ffffb4668064fc18 EFLAGS: 00010002
[72701.194063] RAX: 0000000000000038 RBX: ffffb4668064fc38 RCX: 0000000000000000
[72701.194064] RDX: 0000000000000000 RSI: ffff99191f521540 RDI: ffff99191f521540
[72701.194065] RBP: 0000000000000003 R08: 0000000000000000 R09: ffffb4668064fac0
[72701.194065] R10: 0000000000000003 R11: ffffffff83146508 R12: 0000000000000000
[72701.194066] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[72701.194067] FS:  00007f2ff1fcb040(0000) GS:ffff99191f500000(0000) knlGS:0000000000000000
[72701.194068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[72701.194068] CR2: 0000000000417850 CR3: 000000033905e006 CR4: 0000000000f70ee0
[72701.194069] PKRU: 55555554
[72701.194070] note: stapio[31078] exited with irqs disabled
Comment 1 William Cohen 2023-08-17 15:48:11 UTC
When comparing the output of objdump of the /usr/lib/debug/lib/modules/6.4.10-200.fc38.x86_64/vmlinux and the dmesg "Code :" it appears that the ENDBR64 and the 
call to __fentry__ get overwrite with a multibyte nop:

from the objdump -d /usr/lib/debug/lib/modules/6.4.10-200.fc38.x86_64/vmlinux

ffffffff81206980 <kallsyms_lookup_name>:
kallsyms_lookup_name():
/usr/src/debug/kernel-6.4.10/linux-6.4.10-200.fc38.x86_64/kernel/kallsyms.c:271
ffffffff81206980:	f3 0f 1e fa          	endbr64
ffffffff81206984:	e8 27 dc e7 ff       	call   ffffffff810845b0 <__fentry__>
ffffffff81206989:	53                   	push   %rbx
ffffffff8120698a:	48 83 ec 10          	sub    $0x10,%rsp

From the dmesg output:

Code: 79 0a 48 f7 d0 48 03 05 d6 41 5b 01 c3 cc cc cc cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <66> 0f 1f 00 0f 1f 44 00 00 53 48 83 ec 10 65 48 8b 04 25 28 00 00

Working through to match up objdump and dmesg output:

<66> 0f 1f 00 0f 1f 44 00 00  nopw   0x0(%rax,%rax,1)  # *doesn't match original code*
53                            push   %rbx  
48 83 ec 10                   sub    %0x10,$rsp
65 48 8b 04 25 28 00          mov    %gs:0x28,%rax

Something has changed the entry for kallsyms_lookup_name.  Internal kernel calls to the function still work because those are not using indirect calls (call *%reg).
Comment 2 William Cohen 2023-08-22 15:26:59 UTC
Created attachment 15077 [details]
A sample module demonstrating a workaround approach

The flippy_ibt is a really basic module that has surrounds the indirect call with a ibt_save and ibt_restore operation.  To test out on a machine that support IBT:

tar xvf flippy_ibt.tar.gz
cd flippy_ibt
make -C /usr/src/kernels/$(uname -r) M=$PWD V=2 
sudo insmod flippy_ibt.ko
sudo rmmod flippy_ibt

You should see success (no trap for the missing ENDBR) in output of:

dmesg 

One can check that the kernel IBT support is actually active with the same module by installing with:

sudo insmod flippy_ibt.ko ibt_disable=0

You will see a message on the terminal about a segmentation fault and there will be a trap  for the missing ENDBR in the output of dmesg.

That
Comment 3 William Cohen 2023-08-24 19:10:13 UTC
Created attachment 15085 [details]
Partial implementation for the kallsyms_* indirect calls in runtime/sym.c

This patch addresses the indirect calls in the runtime/sym.c.  This is not complete.  Did a search for the runtime doing additional lookups and indirect calls.  Below is a list of the other variables used for indirect calls and where the indirect calls are implemented.

  kallsyms_copy_to_kernel_nofault
    magic macro in runtime/linux/loc2c-runtime.h
  kallsyms_task_user_regset_view
    magic macro in runtime/linux/loc2c-runtime.h
  kallsyms_uprobe_register
    magic macro in linux/uprobes-inode.c
  kallsyms_uprobe_unregister
    magic macro in linux/uprobes-inode.c
  kallsyms_uprobe_get_swbp_addr
    magic macro in linux/uprobes-inode.c
  kallsyms_get_mm_exe_file
    task_finder_vma.c (looks workable)
  kallsyms_task_work_add
    magic macro stp_task_work.c
  kallsyms_task_work_cancel
    magic macro stp_task_work.c
  kallsyms_udelay_simple
    magic macro linux/runtime.h
  stack_trace_save_regs_fn
    stack.c
  kallsyms_wake_up_state
    magic macro stp_utrace.c
  kallsyms_signal_wake_up_state
    magic macro stp_utrace.c
  kallsyms_signal_wake_up
    magic macro stp_utrace.c
  kallsyms___lock_task_sighand
    magic macro stp_utrace.c

Some of these are implemented with a "magic macro" that just replaces the function name with the indirect calls using the variable.  They don't deal with the function arguments.  There might be changes in the arguments between different versions of the kernel and this might simplify code generations. The macros avoid those details by just changing the function name.  However, if doing the wrappers link this initial patch would need to include the arguments and deal with any changes (like kallsysms_lookup_name) in the arguments.

Another concern is that the IBT wrapper is going to slow things down operation.  This might be noticeable for kallsyms_copy_to_kernel_nofault.
Comment 4 William Cohen 2023-08-27 23:40:00 UTC
Created attachment 15088 [details]
This patch works well enough on Intel IBT machine to allow testing of the examples

This is the updated patch that wraps the indirect calls to allow them to work on IBT enabled machine.  I was able to run systemtap with the patch with:

sudo make installcheck RUNTESTFLAGS="-debug systemtap.examples/check.exp"

The tests of the examples ran to completion and had the following number of passes and failures:

		=== systemtap Summary ===

# of expected passes		364
# of unexpected failures	15
# of untested testcases		29
Comment 5 William Cohen 2023-08-29 14:09:14 UTC
The 2023-08-27 patch has been tried out on aarch64 f38 and x86 rhel8/9 and systemtap also work with the patch installed on those systems.
Comment 6 William Cohen 2023-08-29 15:53:24 UTC
The following commit has been added to the upstream systemtap git repo to address the issue:

commit f70b8278cf39848c75a8bdcf4a41d7463422b666 (HEAD -> master, origin/master, origin/HEAD)
Author: William Cohen <wcohen@redhat.com>
Date:   Tue Aug 29 11:33:41 2023 -0400

    PR30777: Allow systemtap to work on Intel machines with IBT enabled
    
    Intel 11th gen processors include Indirect Branch Target (IBT)
    support.  Systemtap needs to take some additional steps to work in
    this environment.  For kernels that do not have CONFIG_X86_KERNEL_IBT
    set these steps are turned into NOPS.