Bug 23031 - [arm] SIGILL in memchr
Summary: [arm] SIGILL in memchr
Status: RESOLVED INVALID
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: 2.26
: P2 normal
Target Milestone: 2.28
Assignee: Adhemerval Zanella
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-05 08:18 UTC by Jef Driesen
Modified: 2018-04-16 11:23 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2018-04-11 00:00:00
fweimer: security-


Attachments
Yocto script used to execute the configure script (2.67 KB, application/x-shellscript)
2018-04-05 08:18 UTC, Jef Driesen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jef Driesen 2018-04-05 08:18:24 UTC
Created attachment 10934 [details]
Yocto script used to execute the configure script

I build a Yocto image for an ARM board with a Freescale i.MX6 Dualcore CPU. After upgrading from Yocto Pyro (glibc 2.25, commit db0242e3023436757bbc7c488a779e6e3343db04) to Yocto Rocko (glibc 2.26, commit d300041c533a3d837c9f37a099bcc95466860e98), my board doesn't boot anymore. The problem appears to be a crash in sysvinit. By booting a working image and chroot'ing into the bad image, I discovered the problem is located in the new glibc. Several application are crashing with almost identical stracktraces:

# bash

Core was generated by `/bin/sh -i'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x0006179e in show_shell_version ()
(gdb) bt
#0  0x0006179e in show_shell_version ()
#1  <signal handler called>
#2  memchr () at ../sysdeps/arm/armv7/multiarch/memchr_impl.S:91
#3  0x76e93198 in _IO_getdelim (lineptr=0x7ea90910, lineptr@entry=0x1, n=n@entry=0x7ea90914, delimiter=delimiter@entry=10, fp=fp@entry=0x197a8a8) at /usr/src/debug/glibc/2.26-r0/git/libio/iogetdelim.c:85
#4  0x76e8f87c in __getline (lineptr=lineptr@entry=0x1, n=n@entry=0x7ea90914, stream=stream@entry=0x197a8a8) at /usr/src/debug/glibc/2.26-r0/git/stdio-common/getline.c:32
#5  0x76f15810 in nss_parse_file (fname=0x76f432c9 "/etc/nsswitch.conf") at /usr/src/debug/glibc/2.26-r0/git/nss/nsswitch.c:565
#6  __GI___nss_database_lookup (database=0x76f15810 <__GI___nss_database_lookup+508> "", alternate_name=0x0, defconfig=0x76f432f9 "compat [NOTFOUND=return] files", ni=0x76f5e4b0 <__nss_passwd_database>)
    at /usr/src/debug/glibc/2.26-r0/git/nss/nsswitch.c:125
#7  0x76f16e9c in __GI___nss_passwd_lookup2 (ni=0x7ea90968, ni@entry=0x7ea90960, fct_name=0x76f41201 "getpwuid_r", fct2_name=0x0, fctp=0x7ea9096c, fctp@entry=0x7ea90964) at /usr/src/debug/glibc/2.26-r0/git/nss/XXX-lookup.c:69
#8  0x76ec9428 in __getpwuid_r (uid=1995814948, uid@entry=0, resbuf=resbuf@entry=0x76f5cb1c <resbuf.13126>, buffer=0x197a4a0 "\370\273\365v\370\273\365v\230\244\227\001\230\244\227\001\366\t\f", buflen=1024, result=0x7ea909a4,
    result@entry=0x7ea9099c) at /usr/src/debug/glibc/2.26-r0/git/nss/getXXbyYY_r.c:270
#9  0x76ec8d4c in getpwuid (uid=0) at /usr/src/debug/glibc/2.26-r0/git/nss/getXXbyYY.c:134
#10 0x0002d8b8 in get_current_user_info ()
#11 0x00028c2c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

# dmesg

Core was generated by `dmesg'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x76ebee7a in strerror_l (errnum=11, loc=0x199d5) at /usr/src/debug/glibc/2.25-r0/git/string/strerror_l.c:56
(gdb) bt
#0  memchr () at ../sysdeps/arm/armv7/multiarch/memchr_impl.S:91
#1  0x76ebaa04 in __GI___memmem (haystack_start=0x99a810, haystack_len=22, needle_start=0x199d5, needle_len=11) at /usr/src/debug/glibc/2.26-r0/git/string/memmem.c:66
#2  0x00013fb4 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

# python

Core was generated by `python'.
Program terminated with signal SIGILL, Illegal instruction.

(gdb) bt
#0  memchr () at ../sysdeps/arm/armv7/multiarch/memchr_impl.S:91
#1  0x76d578c0 in two_way_short_needle (needle_len=<optimized out>, needle=0x76f1058d "set tabsize=", haystack_len=1, haystack=0x7ed0731e "sitecustomize.py (C) 2002-2008 Michael 'Mickey' Lauer <mlauer@van")
    at /usr/src/debug/glibc/2.26-r0/git/string/str-two-way.h:293
#2  __GI_strstr (haystack_start=<optimized out>, needle_start=<optimized out>) at /usr/src/debug/glibc/2.26-r0/git/string/strstr.c:84
#3  0x76e327e4 in PyTokenizer_Get () from /mnt/usr/lib/libpython2.7.so.1.0
Backtrace stopped: Cannot access memory at address 0x13bef0


There is only one relevant change in this area, and that's commit f8f72bc0c3da8ba039e6a1ed670ca576120b1f85 ("[ARM] Optimise memchr for NEON-enabled processors"). If I simply revert that commit, then the crash disappears. So that confirms the problem was introduced in that specific commit. Unfortunately I don't have any experience with assembler to debug this further, but of course I'll help as much as I can to get this fixed.

In the master branch there is one extra change, with commit a1a638dda91ed7739a066477908511e53840603b ("arm: Implement memchr ifunc selection in C"). But I also tested with glibc 2.27 (commit 23158b08a0908f381459f273a984c6fd328363cb), and the crash is still present.


Configure options (see also the attached Yocto configure script for all environment variables)

--build=x86_64-linux
--host=arm-poky-linux-gnueabi
--target=arm-poky-linux-gnueabi
--prefix=/usr
--exec_prefix=/usr
--bindir=/usr/bin
--sbindir=/usr/sbin
--libexecdir=/usr/libexec
--datadir=/usr/share
--sysconfdir=/etc
--sharedstatedir=/com
--localstatedir=/var
--libdir=/usr/lib
--includedir=/usr/include
--oldincludedir=/usr/include
--infodir=/usr/share/info
--mandir=/usr/share/man
--disable-silent-rules
--disable-dependency-tracking
--with-libtool-sysroot=/home/jdi/develop/yocto/build/tmp/work/armv7ahf-neon-poky-linux-gnueabi/glibc/2.27-r0/recipe-sysroot
--enable-kernel=3.2.0
--without-cvs
--disable-profile
--disable-debug
--without-gd
--enable-clocale=gnu
--enable-add-ons=libidn
--with-headers=/home/jdi/develop/yocto/build/tmp/work/armv7ahf-neon-poky-linux-gnueabi/glibc/2.27-r0/recipe-sysroot/usr/include
--without-selinux
--enable-obsolete-rpc
--enable-obsolete-nsl
--enable-tunables
--enable-bind-now
--enable-stack-protector=strong
--enable-stackguard-randomization
--enable-nscd
--disable-werror
--disable-static

$ arm-poky-linux-gnueabi-gcc --version
arm-poky-linux-gnueabi-gcc (GCC) 7.3.0

$ arm-poky-linux-gnueabi-ld --version
GNU ld (GNU Binutils) 2.29.1.20170915

# uname -a
Linux FP00112A21E7A7 4.7.0-gb6c623e #1 SMP Tue Mar 27 16:03:52 CEST 2018 armv7l GNU/Linux

# cat /proc/cpuinfo 
processor	: 0
model name	: ARMv7 Processor rev 10 (v7l)
BogoMIPS	: 3.00
Features	: half fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x2
CPU part	: 0xc09
CPU revision	: 10

processor	: 1
model name	: ARMv7 Processor rev 10 (v7l)
BogoMIPS	: 3.00
Features	: half fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x2
CPU part	: 0xc09
CPU revision	: 10

Hardware	: Freescale i.MX6 Quad/DualLite (Device Tree)
Revision	: 0000
Serial		: 0000000000000000
Comment 1 Andreas Schwab 2018-04-05 09:14:39 UTC
memchr_impl.S:91:	vdup.8	vrepchr, chrin	/* Duplicate char across all lanes. */

Does your device not support Advanced SIMD and floating-point?
Comment 2 Jef Driesen 2018-04-06 06:51:05 UTC
The Freescale i.MX6 Duallite is a cortex-a9 cpu, which supports vfp and neon. The /proc/cpuinfo confirms this:

# cat /proc/cpuinfo | grep Features
Features	: half fastmult vfp edsp neon vfpv3 tls vfpd32
Comment 3 Adhemerval Zanella 2018-04-06 13:57:12 UTC
Did you enable NEON on the kernel (CONFIG_NEON)? GLIBC 2.26 packs with an optimized memchr for NEON-enabled processors (f8f72bc0c3) and the logic to enable is as:

  - if compiler supports NEON as default (by defining the builtin __ARM_NEON__) then the memchr with neon support will be used as default.

  - otherwise memchr will be a IFUNC which will select either a NEON or NONEON depending of the hwcap support.

If your kernel is indeed not configured to provide neon support, I think a better option would to enable memchr as default.
Comment 4 Jef Driesen 2018-04-06 14:12:22 UTC
Our kernel defconfig file contains:

CONFIG_VFP=y
CONFIG_VFPv3=y
CONFIG_NEON=y
CONFIG_KERNEL_MODE_NEON=y

So unless it gets disabled elsewhere (*), the kernel is build with neon enabled.

(*) I'll double check the build procedure to rule this out. Unfortunately CONFIG_IKCONFIG is not enabled.
Comment 5 Adhemerval Zanella 2018-04-06 20:17:26 UTC
The i.MX6 default kernel config (arch/arm/configs/imx_v6_v7_defconfig) indeed have CONFIG_NEON set as default, but I can't anything different that would trigger a SIGILL for a 'vdup.8'.

Could you check if NEON instructions are indeed running by checking with this simple tests:

---
#include <stdio.h>
#include <stdlib.h>
#include <arm_neon.h>

int main (int argc, char *argv[])
{
  int v = atoi (argv[1]);

  uint8x8_t x;
  asm volatile ("vdup.8 %0,%1\n" :
                "=w" (x) :
                "r" (v));

  for (int i=0; i<8; i++)
    printf("0x%02x, ", x[i]);
  printf ("\n");
  return 0;
}
---
Comment 6 Jef Driesen 2018-04-09 07:41:05 UTC
The test application works fine:

# ./neon 1
0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01,

Compiled with:

$ $CC -o neon neon.c
$ echo $CC
arm-poky-linux-gnueabi-gcc -march=armv7-a -marm -mfpu=neon -mfloat-abi=hard --sysroot=/home/jdi/develop/yocto/build/tmp/work/armv7ahf-neon-poky-linux-gnueabi/meta-ide-support/1.0-r3/recipe-sysroot
Comment 7 Adhemerval Zanella 2018-04-09 17:25:59 UTC
Right, so neon does seems to be working.  The only thing I can think of is some unexpected iteration between thumb and neon (as reported one some cases [1]). Could you check if the patch below (again master) which disables thumb for neon memchr works on your environment:

---
diff --git a/sysdeps/arm/armv7/multiarch/memchr_neon.S b/sysdeps/arm/armv7/multiarch/memchr_neon.S
index 1b2ae75..dabf354 100644
--- a/sysdeps/arm/armv7/multiarch/memchr_neon.S
+++ b/sysdeps/arm/armv7/multiarch/memchr_neon.S
@@ -68,11 +68,7 @@
  * allows to identify exactly which byte has matched.
  */
 
-#ifndef NO_THUMB
-       .thumb_func
-#else
        .arm
-#endif
        .p2align 4,,15
 
 ENTRY(memchr)
@@ -132,12 +128,8 @@ ENTRY(memchr)
        /* The first block can also be the last */
        bls             .Lmasklast
        /* Have we found something already? */
-#ifndef NO_THUMB
-       cbnz            synd, .Ltail
-#else
        cmp             synd, #0
        bne             .Ltail
-#endif
 
 
 .Lloopintro:
@@ -176,17 +168,9 @@ ENTRY(memchr)
        vpadd.i8        vdata0_0, vdata0_0, vdata1_0
        vpadd.i8        vdata0_0, vdata0_0, vdata0_0
        vmov            synd, vdata0_0[0]
-#ifndef NO_THUMB
-       cbz             synd, .Lnotfound
-       bhi             .Ltail  /* Uses the condition code from
-                                  subs cntin, cntin, #32 above.  */
-#else
+       bhi             .Ltail
        cmp             synd, #0
        beq             .Lnotfound
-       cmp             cntin, #0
-       bhi             .Ltail
-#endif
-
 
 .Lmasklast:
        /* Clear the (-cntin) upper bits to avoid out-of-bounds matches. */
---

[1] https://community.nxp.com/thread/394789
Comment 8 Jef Driesen 2018-04-10 11:24:29 UTC
I tested a build (v2.27 instead of master) with your patch applied, and the problem is gone now! So it looks like you found the problem!
Comment 9 Peter Maydell 2018-04-10 14:29:35 UTC
Your /proc/cpuinfo Features line:
"Features	: half fastmult vfp edsp neon vfpv3 tls vfpd32"
does not include "thumb".

This probably means that your kernel was built with CONFIG_ARM_THUMB disabled. As the kernel documentation for that config option says:
"If this option is disabled, and you run userspace that switches to Thumb mode, signal handling will not work correctly, resulting in segmentation faults or illegal instruction aborts."

(More specifically, my guess about what is probably happening is that the first insn in memchr is the first one in the process to use Neon/FP, and so the kernel has the FPU disabled (for lazy context switching purposes). So you get an UNDEF on it, which the kernel is supposed to handle by enabling the FPU and restarting the insn. But because the kernel doesn't have Thumb userspace support compiled in, it doesn't recognise this as a Thumb FP insn, so it doesn't do that, and just sends a SIGILL instead.)

Can you check your kernel config and make sure it has CONFIG_ARM_THUMB set?
Comment 10 Jef Driesen 2018-04-10 14:43:13 UTC
The CONFIG_ARM_THUMB option is indeed not enabled in our kernel:

$ grep CONFIG_ARM_THUMB .config
# CONFIG_ARM_THUMB is not set
# CONFIG_ARM_THUMBEE is not set
Comment 11 Adhemerval Zanella 2018-04-10 15:28:48 UTC
The problem is the usage of the bogus NO_THUMB instead of the compiler provided __thumb__ preprocessor (as for other places).  Unfortunately the optimized neon memchr also has a bug in ARM code path.

I will prepare a patch to fix it upstream.
Comment 12 Peter Maydell 2018-04-10 15:39:12 UTC
I would recommend Jef fixes his kernel config to enable CONFIG_ARM_THUMB as well. It doesn't really cost anything, and it means that binaries that happen to use Thumb insns will just work rather than mysteriously and confusingly misbehaving.
Comment 13 Adhemerval Zanella 2018-04-11 21:18:36 UTC
I sent a fix upstream for review [1].  It fixes both armv7 memchr and strcmp build to not use thumb instructions if compilers is not configured to use emit them (-marm).

[1] https://sourceware.org/ml/libc-alpha/2018-04/msg00198.html
Comment 14 Jef Driesen 2018-04-12 14:54:05 UTC
Thanks for your support! I would never have found the root cause without your help! This was a long but interesting journey, going from a board that failed to boot, to suspecting some bug in glibc and finally discovering a problem in the glibc arm assembler code.

I'll probably backport your patches to glibc 2.26 for our builds.
Comment 15 Adhemerval Zanella 2018-04-13 20:16:02 UTC
The consensus on discussion upstream from thread noted in comment #13 is CONFIG_ARM_THUMB is expected to be set for CONFIG_CPU_V7, i.e, a kernel configured to *disable* thumb for ARMv7 is not supported with glibc.  As Phil Blundell has put the cost of enabling CONFIG_ARM_THUMB in your kernel if you're building for ARMv7 anyway is so close to zero as to be completely negligible. 
There is simply no rational reason to not have it included. 

I will close this as NOTABUG, however I would like to check with you why exactly you are disabling thumb in your kernel config.
Comment 16 Jef Driesen 2018-04-16 07:39:39 UTC
I'm not really sure why it's disabled. It looks like it has always been that way. Probably because we were not aware that mixing arm and thumb mode is possible, and hence we assumed it didn't need to be enabled? (And until now that was the case, because everything gets compiled with -marm.)

Another reason could be that we use the same kernel for several boards with a slightly different cpu. But they are all ARMV7 (cortex-a7/8/9), so I think enabling CONFIG_ARM_THUMB should be fine for us.
Comment 17 Adhemerval Zanella 2018-04-16 11:23:46 UTC
Close as invalid as per comment #15.