26038 – [core_dump] call getpwnam_r() appears core_dump

Bug 26038 - [core_dump] call getpwnam_r() appears core_dump

Summary: [core_dump] call getpwnam_r() appears core_dump

Status:	UNCONFIRMED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	nss (show other bugs)
Version:	2.31

Importance:	P2 critical
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-05-25 09:25 UTC by liaichun
Modified:	2020-08-10 12:55 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
nss-make-sure-startp_initialized-do-first (442 bytes, patch) 2020-08-10 12:55 UTC, liaichun	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description liaichun 2020-05-25 09:25:19 UTC

When I repeatedly call getpwnam_r (), there will be a small probability of coredump. I analyze the cause of the problem，please see the description below.

nss/getXXbyYY_r.c

libc_hidden_proto (DB_LOOKUP_FCT)
int
INTERNAL (REENTRANT_NAME) (ADD_PARAMS, LOOKUP_TYPE *resbuf, char *buffer,
			   size_t buflen, LOOKUP_TYPE **result H_ERRNO_PARM
			   EXTRA_PARAMS)
{
   ......
   static bool startp_initialized;
   static lookup_function start_fct;
   union
  {
    lookup_function l;
    void *ptr;
  } fct;
  ......
  if (! startp_initialized)
    {
      no_more = DB_LOOKUP_FCT (&nip, REENTRANT_NAME_STRING,
			       REENTRANT2_NAME_STRING, &fct.ptr);
      if (no_more)     /* first init, if no_more, start_fct would always 0x0 */
	{
         /* do something */  
	}
      else
	{
	  void *tmp_ptr = fct.l;
	  start_fct = tmp_ptr;
          /* do something */
	}
      atomic_write_barrier ();
      startp_initialized = true;
   }
  else
    {
      fct.l = start_fct;   /* if first no_more, start_fct is 0x0, fct.l would be always 0x0. */ 
      nip = startp;
  
      no_more = nip == (service_user *) -1l;
    }
  ......
  while (no_more == 0)
    {
#ifdef NEED_H_ERRNO
      any_service = true;
#endif
      status = DL_CALL_FCT (fct.l, (ADD_VARIABLES, resbuf, buffer, buflen,
				    &errno H_ERRNO_VAR EXTRA_VARIABLES));
    /* fct.l is 0x0, and here call function fct.1 would core dump */
  ......
}

1、when first call DB_LOOKUP_FCT() return -1, startp_initialized would be true, but start_fct stil is 0x0， and would never change.
2、Once start_fct is 0x0, fct.l would be always 0x0 n matter how many times getpwnam_r() is called.
3、if fct.l is 0x0, the next call DL_CALL_FCT(fct.l,xxx) would cause core_dump.

==============================================================================
I think if first startp_initialized err, next should call DB_LOOKUP_FCT() again, util successfully get, and only then is the initialization successful.

libc_hidden_proto (DB_LOOKUP_FCT)
int
INTERNAL (REENTRANT_NAME) (ADD_PARAMS, LOOKUP_TYPE *resbuf, char *buffer,
			   size_t buflen, LOOKUP_TYPE **result H_ERRNO_PARM
			   EXTRA_PARAMS)
{
   ......
   static bool startp_initialized;
   static lookup_function start_fct;
   union
  {
    lookup_function l;
    void *ptr;
  } fct;
  ......
  if (! startp_initialized)
    {
      no_more = DB_LOOKUP_FCT (&nip, REENTRANT_NAME_STRING,
			       REENTRANT2_NAME_STRING, &fct.ptr);
      if (no_more)     
	{
-	  void *tmp_ptr = (service_user *) -1l;
-#ifdef PTR_MANGLE
-	  PTR_MANGLE (tmp_ptr);
-#endif
-	  startp = tmp_ptr;
+#ifdef NEED_H_ERRNO
+	*h_errnop = NETDB_INTERNAL;
+#endif
+	*result = NULL;
+	return errno;
	}
      else
	{
	  void *tmp_ptr = fct.l;
	  start_fct = tmp_ptr;
          /* do something */
	}
      atomic_write_barrier ();
      startp_initialized = true;
   }

  ......
}

Comment 1 liaichun 2020-05-25 09:33:54 UTC

here is my debug code, i increased the judgment of fct, and triggered it
  else
    {
      fct.l = start_fct;   
      nip = startp;
if (fct.l == 0x0)            
   *(int *)0 = 0    
    }

#0  __getpwnam_r (name=0xffffbd62f940 <g_logOwner> "snasuser", resbuf=0xffffbd0257a8, buffer=0xffffbd0257d8 "", buflen=96, result=0x0) at ../nss/getXXbyYY_r.c:306
306             *(int *)0 = 0;
[Current thread is 1 (Thread 0xffffbd0261d0 (LWP 83654))]
Missing separate debuginfos, use: dnf debuginfo-install snas-cm-3.6.1.126270-1.aarch64
(gdb) bt
#0  __getpwnam_r (name=0xffffbd62f940 <g_logOwner> "snasuser", resbuf=0xffffbd0257a8, buffer=0xffffbd0257d8 "", buflen=96, result=0x0) at ../nss/getXXbyYY_r.c:306
#1  0x0000ffffbd46c7d4 in __LOG_LogSetOwner (fd=7) at base_log.c:1778
#2  0x0000ffffbd46b988 in Base_Log_FileNeedExport (iFd=3, pcPath=0xffffbd025938 "/var/log/snas_CM.log", moduleId=MODULE_CM) at base_log.c:942
#3  0x0000ffffbd46ca1c in __LOG_LogSelfTask (pArgv=0xffffbd638198 <g_acSubSys>) at base_log.c:1874
#4  0x0000ffffbe4538bc in start_thread (arg=0xffffc2be497f) at pthread_create.c:486
#5  0x0000ffffbd1fda1c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb)

Comment 2 Andreas Schwab 2020-05-25 09:42:20 UTC

If no_more is nonzero then the loop is never entered.

Comment 3 liaichun 2020-05-25 09:46:51 UTC

(In reply to Andreas Schwab from comment #2)
> If no_more is nonzero then the loop is never entered.

---int no_more not static int no_more,so when i call getpwnam_r () again ,no_more would be zero, and into the loop

Comment 4 Andreas Schwab 2020-05-25 09:52:47 UTC

no_more is set in both branches.

Comment 5 liaichun 2020-05-25 10:00:39 UTC

(In reply to Andreas Schwab from comment #4)
> no_more is set in both branches.

----do you mean here:
      no_more = nip == (service_user *) -1l;
      while (no_more == 0)
  when call again, nip is 0x0,  and no_more is also zero;

Comment 6 Andreas Schwab 2020-05-25 10:06:50 UTC

nip cannot be 0.  It is either a valid function address or -1.

Comment 7 liaichun 2020-05-25 11:25:48 UTC

(In reply to Andreas Schwab from comment #6)
> nip cannot be 0.  It is either a valid function address or -1.

Sorry，I haven't watched all before， 
but it did happen fct.l is zero and nip is a valid function address .

(gdb) info local
startp_initialized = true
startp = 0x58b26143a952eb88
start_fct = 0x58b23416ed3e4548
nip = 0xffffac000b20
do_merge = 0
mergegrp = <optimized out>
mergebuf = 0x0
endptr = 0x0
fct = {l = 0x0, ptr = 0x0}
no_more = <optimized out>
err = <optimized out>
status = NSS_STATUS_UNAVAIL
nscd_status = <optimized out>
res = <optimized out>
(gdb) q

I guess this is caused by multi-threaded calls ?

Comment 8 Andreas Schwab 2020-05-25 12:20:28 UTC

What do 0x58b26143a952eb88 and 0x58b23416ed3e4548 demangle to?

Comment 9 Andreas Schwab 2020-05-25 12:29:29 UTC

nip should of course be a valid heap address or -1.

Comment 10 liaichun 2020-06-09 08:29:05 UTC

(In reply to Andreas Schwab from comment #9)
> nip should of course be a valid heap address or -1.

Sorry, i removed all my edits，and tried for a long time，it finally reappeared.

(gdb) p __pointer_chk_guard
$1 = 8256925845475092582
(gdb) p * __pointer_chk_guard
Cannot access memory at address 0x72967e4f3448e866
(gdb) p/x __pointer_chk_guard
$2 = 0x72967e4f3448e866
(gdb) bt
#0  0x72967e4f3448e866 in ?? ()
#1  0x0000ffff980683a4 in __getpwnam_r (name=0xffff984ca940 <g_logOwner> "snasuser", resbuf=0xffff975be868, buffer=0xffff975be898 "", buflen=96, result=0x0) at ../nss/getXXbyYY_r.c:315
#2  0x0000ffff983077d4 in __LOG_LogSetOwner (fd=7) at base_log.c:1778
#3  0x0000ffff98308cc0 in __LOG_Common_LogSelfTask (pArgv=0xffff991a1df8 <g_stLogUserAuditMgt>) at base_log.c:2588
#4  0x0000ffff992ee8bc in start_thread (arg=0xfffffcac7aff) at pthread_create.c:486
#5  0x0000ffff980989dc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) info local
startp_initialized = true
startp = 0x7296d4e5ca99a906
start_fct = 0x729681b0a29137c6
nip = 0x72967e4f3448e866
do_merge = 0
mergegrp = <optimized out>
mergebuf = 0xffff9841a658 "V100R001C01"
endptr = 0x0
fct = {l = 0x72967e4f3448e866, ptr = 0x72967e4f3448e866}
no_more = 0
err = <optimized out>
status = NSS_STATUS_UNAVAIL
nscd_status = <optimized out>
res = <optimized out>
(gdb) exit

nip is 0x72967e4f3448e866 after PTR_DEMANGLE, so i guess nip is zero before PTR_DEMANGLE?

Comment 11 Andreas Schwab 2020-06-09 08:57:48 UTC

Most likely you have some memory corruption.  Try valgrind.

Comment 12 liaichun 2020-06-09 09:43:16 UTC

(In reply to Andreas Schwab from comment #11)
> Most likely you have some memory corruption.  Try valgrind.

Thanks, I will check it. 
And there is also a message ,i tried add atomic_read_barrier() before nip assigned, it hasn't appeared in 2 weeks.If nothing changes, it can appear 4 or 5 times in 2 weeks.

--------------------------------------
libc_hidden_proto (DB_LOOKUP_FCT)
  else
    {
+     atomic_read_barrier();
      fct.l = start_fct;
      nip = startp;
#ifdef PTR_DEMANGLE
      PTR_DEMANGLE (fct.l);
      PTR_DEMANGLE (nip);
#endif
      no_more = nip == (service_user *) -1l;
    }

Comment 13 liaichun 2020-08-05 13:57:25 UTC

(In reply to liaichun from comment #12)
> (In reply to Andreas Schwab from comment #11)
> > Most likely you have some memory corruption.  Try valgrind.
> 
> Thanks, I will check it. 
> And there is also a message ,i tried add atomic_read_barrier() before nip
> assigned, it hasn't appeared in 2 weeks.If nothing changes, it can appear 4
> or 5 times in 2 weeks.
> 
> --------------------------------------
> libc_hidden_proto (DB_LOOKUP_FCT)
>   else
>     {
> +     atomic_read_barrier();
>       fct.l = start_fct;
>       nip = startp;
> #ifdef PTR_DEMANGLE
>       PTR_DEMANGLE (fct.l);
>       PTR_DEMANGLE (nip);
> #endif
>       no_more = nip == (service_user *) -1l;
>     }

================================================================
I Try valgrind, but nothing found.
And the same coredump occurs several times ... 
If i add atomic_read_barrier() before nip assigned, and that never happened again.

Comment 14 liaichun 2020-08-10 12:55:26 UTC

Created attachment 12757 [details]
nss-make-sure-startp_initialized-do-first