context: OS:Linux saturn2-sfu-eng 4.14.172.saturn2-sfu-r2.2.1.3 #1 Sat May 11 08:47:16 UTC 2024 mips GNU/Linux arch:mips 32 we employ kernel API fork/execv and try to load a ELF-format file with a file name of MecMgr, as a result the main function is not entered occationally. tracking the routine, it is found that the process is trapped in __libc_csu_init function, no return or forward, and if add some printf-clauses in the function showing us where is is when error occurred, the issue cann't be reproduced again.
Sorry, could you describe what you are doing in more detail? Thanks. We have not received a report of a similar issue, as far as I can recall.
(In reply to Florian Weimer from comment #1) > Sorry, could you describe what you are doing in more detail? Thanks. > > We have not received a report of a similar issue, as far as I can recall. we try to load another process in our program, coding as follows, if ((pid = fork()) < 0) { ASSERT(0); } // In child process if (0 == pid) { setpriority(0, 0, (INT32)priority - 20); sprintf_s(priStr, sizeof(priStr), "%d", priority); sprintf_s(stackStr, sizeof(stackStr), "%d", stackSize); paraList[0] = (CHAR*)execFileName; paraList[1] = "-p"; paraList[2] = priStr; paraList[3] = "-s"; paraList[4] = stackStr; paraList[5] = NULL; execve(execFileName, paraList, env); } as a result, the new process can't be found occationally, then we trace the routine, it is found that execv has transfered the control right to ld.so, and in function LIBC_START_MAIN it has come to init function(here init is __libc_csu_init) and trapped there, no return, no forward, because we found output message "initialize program:" and no "transferring control:", as follows, 364: 364: calling init: /lib/libm.so.6 364: 364: 364: --ljh--initialize program start: /usr/bin/MecMgr 364: Starting Application: 0x00003000, /usr/bin/MecMgr................ the next, we added some printf clauses in __libc_csu_init trying to find out where it is when error occured, but unforturnately the issue can't be reproduced at this time. we also have tried to print what the function __libc_csu_init looked like before it is executed in function LIBC_START_MAIN, unforturnately the issue can't be reproduced again. it seems that if we do any modification in or before the function __libc_csu_init, then the issue disappears.
If the new process is hanging, it should be easy enough to attach GDB it with “gdb -p PID”, and figure out what is going on.
(In reply to Florian Weimer from comment #3) > If the new process is hanging, it should be easy enough to attach GDB it > with “gdb -p PID”, and figure out what is going on. Thanks for your advice. Unfortunitely because of the size limit, GDB component is reduced. When the issue happens, it is found that the process cann't be seen with "ps" command, also not present under the /proc directory. Could you kindly share me any third tools to peek the process memory, which is tiny and easy to get?thanks a lot.
If the process does not show up on /proc, it doesn't exist, so this must be something else.
Newly investigating shows that when __libc_csu_init ran to the last and went back to the caller __libc_start_main through instruction jr ra, CPU threw a RI(reserved instruction) exception, then kernel send a SIGILL signal to the process, and stopped the process. (gdb) info all-registers zero at v0 v1 a0 a1 a2 a3 R0 00000000 00000001 00000000 00000000 556c3084 00000000 7f7bb9bc 00000000 t0 t1 t2 t3 t4 t5 t6 t7 R8 00000063 00000000 00000000 8446bd30 8572eeb0 77568520 00000000 7755f0bc s0 s1 s2 s3 s4 s5 s6 s7 R16 00000000 556a0f60 00000000 7fd270e4 55602168 77f3d000 77fabac8 bc8958d1 t8 t9 k0 k1 gp sp s8 ra R24 77f5b4a0 00000000 00000000 00000000 556cb080 7f7bb898 55769190 773eb198 sr lo hi bad cause pc 00009c0c 00000000 00000000 556a1000 00000028 556a1000 fcsr fir 00000000 00000000 0007ef60 <__libc_csu_init@@Base>: 7ef60: 3c1c0003 lui gp,0x3 7ef64: 279ca120 addiu gp,gp,-24288 7ef68: 0399e021 addu gp,gp,t9 7ef6c: 27bdffc8 addiu sp,sp,-56 7ef70: afbf0034 sw ra,52(sp) 7ef74: afb50030 sw s5,48(sp) 7ef78: afb4002c sw s4,44(sp) 7ef7c: afb30028 sw s3,40(sp) 7ef80: afb20024 sw s2,36(sp) 7ef84: afb10020 sw s1,32(sp) 7ef88: afb0001c sw s0,28(sp) 7ef8c: 00809825 move s3,a0 7ef90: 8f998c1c lw t9,-29668(gp) 7ef94: 00a0a025 move s4,a1 7ef98: afbc0010 sw gp,16(sp) 7ef9c: 0320f809 jalr t9 7efa0: 00c0a825 move s5,a2 7efa4: 8fbc0010 lw gp,16(sp) 7efa8: 8f908c20 lw s0,-29664(gp) 7efac: 8f928c24 lw s2,-29660(gp) 7efb0: 02509023 subu s2,s2,s0 7efb4: 00129083 sra s2,s2,0x2 7efb8: 1240000a beqz s2,7efe4 <__libc_csu_init@@Base+0x84> 7efbc: 00008825 move s1,zero 7efc0: 8e190000 lw t9,0(s0) 7efc4: 02a03025 move a2,s5 7efc8: 02802825 move a1,s4 7efcc: 26310001 addiu s1,s1,1 7efd0: 02602025 move a0,s3 7efd4: 0320f809 jalr t9 7efd8: 26100004 addiu s0,s0,4 7efdc: 1651fff8 bne s2,s1,7efc0 <__libc_csu_init@@Base+0x60> 7efe0: 00000000 nop 7efe4: 8fbf0034 lw ra,52(sp) 7efe8: 8fb50030 lw s5,48(sp) 7efec: 8fb4002c lw s4,44(sp) 7eff0: 8fb30028 lw s3,40(sp) 7eff4: 8fb20024 lw s2,36(sp) 7eff8: 8fb10020 lw s1,32(sp) 7effc: 8fb0001c lw s0,28(sp) 7f000: 03e00008 jr ra //a RI exception is threw, when came here 7f004: 27bd0038 addiu sp,sp,56
This could be stack corruption by an ELF constructor, overwriting the stored return address.
(In reply to Florian Weimer from comment #7) > This could be stack corruption by an ELF constructor, overwriting the stored > return address. when exception occurred, the address stored in ra and the value stored in the address is the same as normal. the returned address is 0x773b65e4, and the value stored in 0x773b65e4 is 0x8fbc0010(lw gp,16(sp)), a normal mips32 instruction. also, we printed what the function __libc_csu_init in memory was, and shows that nothing wrong with it. [13:12:220][ 44.775721] do_ri:1219 send sigill(st:-1 -1 -1 -1 -1 -1) cpu(1 1 1 1) [13:12:220][ 44.782016] do_ri:1241 send sigill status:-1 cause:0x00000028 badvaddr:0x55664000 cp-st:0x00009c0c lo-0x04674ed1 hi-0x00000002 last-0x00000000 [13:12:220][ 44.794849] 32Reg:00000000 00000001 00000000 00000000 55686084 00000000 7fd651dc ffffffff 7752ce50 7752ce50 00000000 00000000 7fd64f58 0000000b 00000000 77f2f000 [13:12:221][ 44.794849] 55663f60 00000000 00000000 7f8b8b2c 555c4124 77f05000 77f73ac8 b17eee51 00000000 00000000 00000010 00000000 5568e080 7fd65020 555d6190 773b65e4 [13:12:235][ 44.823122] do_ri:1252 send sigill epc:0x55664000 r31:0x773b65e4(0x8fbc0010 0x8f8294e8 0x8c5400c8 0x16800032) [13:12:235]sno:4 Fault address:0 s-code:128 eno:0 [13:12:457]/lib/libc.so.6(+0x3179a160) [0x773b6160] [13:12:458]linux-vdso.so.1(+0x920) [0x7ff67920] [13:12:458]/usr/bin/MecMgr(__libc_csu_init+0xa2) [0x55664002] [13:12:459]txt(0x55664002):0x08 00 e0 03 38 00 bd 27 0800e003 00000000 [13:12:460]00000000 3400bf8f 3000b58f 2c00b48f [13:12:460]2800b38f 2400b28f 2000b18f 1c00b08f [13:12:461]start address:0x55663f04 [13:12:462]00:03e00008 00a21023 0082102b 14400007 [13:12:462]01:8f838c14 00042602 24050008 00642021 [13:12:462]02:90820000 03e00008 00a21023 00042402 [13:12:463]03:24050010 00642021 90820000 03e00008 [13:12:463]04:00a21023 00042202 24050018 00642021 [13:12:464]05:90820000 03e00008 00a21023 3c1c0003 [13:12:464]06:279ca120 0399e021 27bdffc8 afbf0034 [13:12:465]07:afb50030 afb4002c afb30028 afb20024 [13:12:465]08:afb10020 afb0001c 00809825 8f998c1c [13:12:466]09:00a0a025 afbc0010 0320f809 00c0a825 [13:12:466]10:8fbc0010 8f908c20 8f928c24 02509023 [13:12:467]11:00129083 1240000a 00008825 8e190000 [13:12:467]12:02a03025 02802825 26310001 02602025 [13:12:468]13:0320f809 26100004 1651fff8 00000000 [13:12:469]14:8fbf0034 8fb50030 8fb4002c 8fb30028 [13:12:469]15:8fb20024 8fb10020 8fb0001c 03e00008 [13:12:470]16:27bd0038 03e00008 00000000 8f998010 [13:12:470]17:03e07825 0320f809 241803dc 8f998010 [13:12:476]18:03e07825 0320f809 241803db 8f998010 [13:12:477]19:03e07825 0320f809 241803da 8f998010
################# [13:12:458]/usr/bin/MecMgr(__libc_csu_init+0xa2) [0x55664002] ################## we found that call-stack printed with __backtrace function which is in glibc library is abnormal, the instruction address(offset is 0xa2) is not aligned, but value stored in register is right(offset is 0xa0). we have ignored this clue and think value stored in the register is all right. we didn't investigate further for this.
hi fweimer, what's your opinion? what would you suggest us to do? it seems nothing wrong with the "illegal instruction" : jr ra addiu sp,sp,56 also the following instruction(addiu sp,sp,56) is completely all right, even though considering branch&jump delay slot factor. what can we do the next step? could you kindly help us about this, thanks a lot.
It seems to be an issue tied to the MIPS architectures, along with the kernel used. This kind of problem is really hard to debug without either prior knowledge of the architecture and/or access to the hardware itself (for instance, check BZ 31394 where it seems to be really hard to debug sparc issue). Is this issue reproducible with qemu-system? If so, it would be easier to check; otherwise, I think you will need to figure out why your MIPS box is trapping on what seems to be a valid instruction.