31732 – ld.so is trapped in __libc_csu_init function, no return or forward when load a normal ELF process

Bug 31732 - ld.so is trapped in __libc_csu_init function, no return or forward when load a normal ELF process

Summary: ld.so is trapped in __libc_csu_init function, no return or forward when load ...

Status:	RESOLVED INVALID

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	dynamic-link (show other bugs)
Version:	2.30

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-05-13 03:06 UTC by Liao Zhicai
Modified:	2024-07-16 12:39 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:	2024-05-14 00:00:00

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Liao Zhicai 2024-05-13 03:06:54 UTC

context:
OS:Linux saturn2-sfu-eng 4.14.172.saturn2-sfu-r2.2.1.3 #1 Sat May 11 08:47:16 UTC 2024 mips GNU/Linux
arch:mips 32

   we employ kernel API fork/execv and try to load a ELF-format file with a file name of MecMgr, as a result the main function is not entered occationally. tracking the routine, it is found that the process is trapped in __libc_csu_init function, no return or forward, and if add some printf-clauses in the function showing us where is is when error occurred, the issue cann't be reproduced again.

Comment 1 Florian Weimer 2024-05-14 09:58:13 UTC

Sorry, could you describe what you are doing in more detail? Thanks.

We have not received a report of a similar issue, as far as I can recall.

Comment 2 Liao Zhicai 2024-05-16 03:12:07 UTC

(In reply to Florian Weimer from comment #1)
> Sorry, could you describe what you are doing in more detail? Thanks.
> 
> We have not received a report of a similar issue, as far as I can recall.

we try to load another process in our program, coding as follows,
    if ((pid = fork()) < 0) 
    {
        ASSERT(0);
    }

    // In child process
    if (0 == pid) 
    {
        setpriority(0, 0, (INT32)priority - 20);
        sprintf_s(priStr, sizeof(priStr), "%d", priority);
        sprintf_s(stackStr, sizeof(stackStr), "%d", stackSize);
        paraList[0] = (CHAR*)execFileName;
        paraList[1] = "-p";
        paraList[2] = priStr;
        paraList[3] = "-s";
        paraList[4] = stackStr;
        paraList[5] = NULL;
        execve(execFileName, paraList, env);
    }
as a result, the new process can't be found occationally, then we trace the routine, it is found that execv has transfered the control right to ld.so,
and in function LIBC_START_MAIN it has come to init function(here init is __libc_csu_init) and trapped there, no return, no forward, because we found output message "initialize program:" and no "transferring control:", as follows,
       364:	
       364:	calling init: /lib/libm.so.6
       364:	
       364:	
       364:	--ljh--initialize program start: /usr/bin/MecMgr
       364:	
Starting Application: 0x00003000, /usr/bin/MecMgr................

the next, we added some printf clauses in __libc_csu_init trying to find out where it is when error occured, but unforturnately the issue can't be reproduced     at this time.

we also have tried to print what the function __libc_csu_init looked like before it is executed in function LIBC_START_MAIN, unforturnately the issue can't be reproduced again.

it seems that if we do any modification in or before the function __libc_csu_init, then the issue disappears.

Comment 3 Florian Weimer 2024-05-16 11:47:04 UTC

If the new process is hanging, it should be easy enough to attach GDB it with “gdb -p PID”, and figure out what is going on.

Comment 4 Liao Zhicai 2024-05-17 02:34:55 UTC

(In reply to Florian Weimer from comment #3)
> If the new process is hanging, it should be easy enough to attach GDB it
> with “gdb -p PID”, and figure out what is going on.

Thanks for your advice.
Unfortunitely because of the size limit, GDB component is reduced.
When the issue happens, it is found that the process cann't be seen with "ps" command, also not present under the /proc directory.
Could you kindly share me any third tools to peek the process memory, which is tiny and easy to get?thanks a lot.

Comment 5 Florian Weimer 2024-05-17 08:55:06 UTC

If the process does not show up on /proc, it doesn't exist, so this must be something else.

Comment 6 Liao Zhicai 2024-06-26 08:09:35 UTC

Newly investigating shows that when __libc_csu_init ran to the last and went back to the caller __libc_start_main through instruction jr ra, CPU threw a RI(reserved instruction) exception, then kernel send a SIGILL signal to the process, and stopped the process.

(gdb) info all-registers 
          zero       at       v0       v1       a0       a1       a2       a3
 R0   00000000 00000001 00000000 00000000 556c3084 00000000 7f7bb9bc 00000000 
            t0       t1       t2       t3       t4       t5       t6       t7
 R8   00000063 00000000 00000000 8446bd30 8572eeb0 77568520 00000000 7755f0bc 
            s0       s1       s2       s3       s4       s5       s6       s7
 R16  00000000 556a0f60 00000000 7fd270e4 55602168 77f3d000 77fabac8 bc8958d1 
            t8       t9       k0       k1       gp       sp       s8       ra
 R24  77f5b4a0 00000000 00000000 00000000 556cb080 7f7bb898 55769190 773eb198 
            sr       lo       hi      bad    cause       pc
      00009c0c 00000000 00000000 556a1000 00000028 556a1000 
          fcsr      fir
      00000000 00000000 


0007ef60 <__libc_csu_init@@Base>:
   7ef60:       3c1c0003        lui     gp,0x3
   7ef64:       279ca120        addiu   gp,gp,-24288
   7ef68:       0399e021        addu    gp,gp,t9
   7ef6c:       27bdffc8        addiu   sp,sp,-56
   7ef70:       afbf0034        sw      ra,52(sp)
   7ef74:       afb50030        sw      s5,48(sp)
   7ef78:       afb4002c        sw      s4,44(sp)
   7ef7c:       afb30028        sw      s3,40(sp)
   7ef80:       afb20024        sw      s2,36(sp)
   7ef84:       afb10020        sw      s1,32(sp)
   7ef88:       afb0001c        sw      s0,28(sp)
   7ef8c:       00809825        move    s3,a0
   7ef90:       8f998c1c        lw      t9,-29668(gp)
   7ef94:       00a0a025        move    s4,a1
   7ef98:       afbc0010        sw      gp,16(sp)
   7ef9c:       0320f809        jalr    t9
   7efa0:       00c0a825        move    s5,a2
   7efa4:       8fbc0010        lw      gp,16(sp)
   7efa8:       8f908c20        lw      s0,-29664(gp)
   7efac:       8f928c24        lw      s2,-29660(gp)
   7efb0:       02509023        subu    s2,s2,s0
   7efb4:       00129083        sra     s2,s2,0x2
   7efb8:       1240000a        beqz    s2,7efe4 <__libc_csu_init@@Base+0x84>
   7efbc:       00008825        move    s1,zero
   7efc0:       8e190000        lw      t9,0(s0)
   7efc4:       02a03025        move    a2,s5
   7efc8:       02802825        move    a1,s4
   7efcc:       26310001        addiu   s1,s1,1
   7efd0:       02602025        move    a0,s3
   7efd4:       0320f809        jalr    t9
   7efd8:       26100004        addiu   s0,s0,4
   7efdc:       1651fff8        bne     s2,s1,7efc0 <__libc_csu_init@@Base+0x60>
   7efe0:       00000000        nop
   7efe4:       8fbf0034        lw      ra,52(sp)
   7efe8:       8fb50030        lw      s5,48(sp)
   7efec:       8fb4002c        lw      s4,44(sp)
   7eff0:       8fb30028        lw      s3,40(sp)
   7eff4:       8fb20024        lw      s2,36(sp)
   7eff8:       8fb10020        lw      s1,32(sp)
   7effc:       8fb0001c        lw      s0,28(sp)
   7f000:       03e00008        jr      ra //a RI exception is threw, when came here
   7f004:       27bd0038        addiu   sp,sp,56

Comment 7 Florian Weimer 2024-06-26 08:12:54 UTC

This could be stack corruption by an ELF constructor, overwriting the stored return address.

Comment 8 Liao Zhicai 2024-06-27 02:05:48 UTC

(In reply to Florian Weimer from comment #7)
> This could be stack corruption by an ELF constructor, overwriting the stored
> return address.

when exception occurred, the address stored in ra and the value stored in the address is the same as normal.
the returned address is 0x773b65e4, and the value stored in 0x773b65e4 is 0x8fbc0010(lw gp,16(sp)), a normal mips32 instruction.

also, we printed what the function __libc_csu_init in memory was, and shows that nothing wrong with it.

[13:12:220][   44.775721] do_ri:1219 send sigill(st:-1 -1 -1 -1 -1 -1) cpu(1 1 1 1)
[13:12:220][   44.782016] do_ri:1241 send sigill status:-1 cause:0x00000028 badvaddr:0x55664000 cp-st:0x00009c0c lo-0x04674ed1 hi-0x00000002 last-0x00000000
[13:12:220][   44.794849] 32Reg:00000000 00000001 00000000 00000000 55686084 00000000 7fd651dc ffffffff 7752ce50 7752ce50 00000000 00000000 7fd64f58 0000000b 00000000 77f2f000
[13:12:221][   44.794849] 55663f60 00000000 00000000 7f8b8b2c 555c4124 77f05000 77f73ac8 b17eee51 00000000 00000000 00000010 00000000 5568e080 7fd65020 555d6190 773b65e4
[13:12:235][   44.823122] do_ri:1252 send sigill epc:0x55664000 r31:0x773b65e4(0x8fbc0010 0x8f8294e8 0x8c5400c8 0x16800032)
[13:12:235]sno:4 Fault address:0 s-code:128 eno:0
[13:12:457]/lib/libc.so.6(+0x3179a160) [0x773b6160]
[13:12:458]linux-vdso.so.1(+0x920) [0x7ff67920]
[13:12:458]/usr/bin/MecMgr(__libc_csu_init+0xa2) [0x55664002]
[13:12:459]txt(0x55664002):0x08 00 e0 03 38 00 bd 27 0800e003 00000000
[13:12:460]00000000 3400bf8f 3000b58f 2c00b48f
[13:12:460]2800b38f 2400b28f 2000b18f 1c00b08f
[13:12:461]start address:0x55663f04
[13:12:462]00:03e00008 00a21023 0082102b 14400007
[13:12:462]01:8f838c14 00042602 24050008 00642021
[13:12:462]02:90820000 03e00008 00a21023 00042402
[13:12:463]03:24050010 00642021 90820000 03e00008
[13:12:463]04:00a21023 00042202 24050018 00642021
[13:12:464]05:90820000 03e00008 00a21023 3c1c0003
[13:12:464]06:279ca120 0399e021 27bdffc8 afbf0034
[13:12:465]07:afb50030 afb4002c afb30028 afb20024
[13:12:465]08:afb10020 afb0001c 00809825 8f998c1c
[13:12:466]09:00a0a025 afbc0010 0320f809 00c0a825
[13:12:466]10:8fbc0010 8f908c20 8f928c24 02509023
[13:12:467]11:00129083 1240000a 00008825 8e190000
[13:12:467]12:02a03025 02802825 26310001 02602025
[13:12:468]13:0320f809 26100004 1651fff8 00000000
[13:12:469]14:8fbf0034 8fb50030 8fb4002c 8fb30028
[13:12:469]15:8fb20024 8fb10020 8fb0001c 03e00008
[13:12:470]16:27bd0038 03e00008 00000000 8f998010
[13:12:470]17:03e07825 0320f809 241803dc 8f998010
[13:12:476]18:03e07825 0320f809 241803db 8f998010
[13:12:477]19:03e07825 0320f809 241803da 8f998010

Comment 9 Liao Zhicai 2024-06-27 02:22:41 UTC

#################
[13:12:458]/usr/bin/MecMgr(__libc_csu_init+0xa2) [0x55664002]
##################
we found that call-stack printed with __backtrace function which is in glibc library is abnormal, the instruction address(offset is 0xa2) is not aligned, but value stored in register is right(offset is 0xa0).

we have ignored this clue and think value stored in the register is all right.
we didn't investigate further for this.

Comment 10 Liao Zhicai 2024-07-15 02:06:03 UTC

hi fweimer,
what's your opinion?
what would you suggest us to do?

it seems nothing wrong with the "illegal instruction" :
jr ra
addiu   sp,sp,56

also the following instruction(addiu sp,sp,56) is completely all right, even though considering branch&jump delay slot factor.

what can we do the next step?
could you kindly help us about this, thanks a lot.

Comment 11 Adhemerval Zanella 2024-07-16 12:39:06 UTC

It seems to be an issue tied to the MIPS architectures, along with the kernel used. This kind of problem is really hard to debug without either prior knowledge of the architecture and/or access to the hardware itself (for instance, check BZ 31394 where it seems to be really hard to debug sparc issue).

Is this issue reproducible with qemu-system? If so, it would be easier to check; otherwise, I think you will need to figure out why your MIPS box is trapping on what seems to be a valid instruction.