When I repeatedly call getpwnam_r (), there will be a small probability of coredump. I analyze the cause of the problem,please see the description below. nss/getXXbyYY_r.c libc_hidden_proto (DB_LOOKUP_FCT) int INTERNAL (REENTRANT_NAME) (ADD_PARAMS, LOOKUP_TYPE *resbuf, char *buffer, size_t buflen, LOOKUP_TYPE **result H_ERRNO_PARM EXTRA_PARAMS) { ...... static bool startp_initialized; static lookup_function start_fct; union { lookup_function l; void *ptr; } fct; ...... if (! startp_initialized) { no_more = DB_LOOKUP_FCT (&nip, REENTRANT_NAME_STRING, REENTRANT2_NAME_STRING, &fct.ptr); if (no_more) /* first init, if no_more, start_fct would always 0x0 */ { /* do something */ } else { void *tmp_ptr = fct.l; start_fct = tmp_ptr; /* do something */ } atomic_write_barrier (); startp_initialized = true; } else { fct.l = start_fct; /* if first no_more, start_fct is 0x0, fct.l would be always 0x0. */ nip = startp; no_more = nip == (service_user *) -1l; } ...... while (no_more == 0) { #ifdef NEED_H_ERRNO any_service = true; #endif status = DL_CALL_FCT (fct.l, (ADD_VARIABLES, resbuf, buffer, buflen, &errno H_ERRNO_VAR EXTRA_VARIABLES)); /* fct.l is 0x0, and here call function fct.1 would core dump */ ...... } 1、when first call DB_LOOKUP_FCT() return -1, startp_initialized would be true, but start_fct stil is 0x0, and would never change. 2、Once start_fct is 0x0, fct.l would be always 0x0 n matter how many times getpwnam_r() is called. 3、if fct.l is 0x0, the next call DL_CALL_FCT(fct.l,xxx) would cause core_dump. ============================================================================== I think if first startp_initialized err, next should call DB_LOOKUP_FCT() again, util successfully get, and only then is the initialization successful. libc_hidden_proto (DB_LOOKUP_FCT) int INTERNAL (REENTRANT_NAME) (ADD_PARAMS, LOOKUP_TYPE *resbuf, char *buffer, size_t buflen, LOOKUP_TYPE **result H_ERRNO_PARM EXTRA_PARAMS) { ...... static bool startp_initialized; static lookup_function start_fct; union { lookup_function l; void *ptr; } fct; ...... if (! startp_initialized) { no_more = DB_LOOKUP_FCT (&nip, REENTRANT_NAME_STRING, REENTRANT2_NAME_STRING, &fct.ptr); if (no_more) { - void *tmp_ptr = (service_user *) -1l; -#ifdef PTR_MANGLE - PTR_MANGLE (tmp_ptr); -#endif - startp = tmp_ptr; +#ifdef NEED_H_ERRNO + *h_errnop = NETDB_INTERNAL; +#endif + *result = NULL; + return errno; } else { void *tmp_ptr = fct.l; start_fct = tmp_ptr; /* do something */ } atomic_write_barrier (); startp_initialized = true; } ...... }
here is my debug code, i increased the judgment of fct, and triggered it else { fct.l = start_fct; nip = startp; if (fct.l == 0x0) *(int *)0 = 0 } #0 __getpwnam_r (name=0xffffbd62f940 <g_logOwner> "snasuser", resbuf=0xffffbd0257a8, buffer=0xffffbd0257d8 "", buflen=96, result=0x0) at ../nss/getXXbyYY_r.c:306 306 *(int *)0 = 0; [Current thread is 1 (Thread 0xffffbd0261d0 (LWP 83654))] Missing separate debuginfos, use: dnf debuginfo-install snas-cm-3.6.1.126270-1.aarch64 (gdb) bt #0 __getpwnam_r (name=0xffffbd62f940 <g_logOwner> "snasuser", resbuf=0xffffbd0257a8, buffer=0xffffbd0257d8 "", buflen=96, result=0x0) at ../nss/getXXbyYY_r.c:306 #1 0x0000ffffbd46c7d4 in __LOG_LogSetOwner (fd=7) at base_log.c:1778 #2 0x0000ffffbd46b988 in Base_Log_FileNeedExport (iFd=3, pcPath=0xffffbd025938 "/var/log/snas_CM.log", moduleId=MODULE_CM) at base_log.c:942 #3 0x0000ffffbd46ca1c in __LOG_LogSelfTask (pArgv=0xffffbd638198 <g_acSubSys>) at base_log.c:1874 #4 0x0000ffffbe4538bc in start_thread (arg=0xffffc2be497f) at pthread_create.c:486 #5 0x0000ffffbd1fda1c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 (gdb)
If no_more is nonzero then the loop is never entered.
(In reply to Andreas Schwab from comment #2) > If no_more is nonzero then the loop is never entered. ---int no_more not static int no_more,so when i call getpwnam_r () again ,no_more would be zero, and into the loop
no_more is set in both branches.
(In reply to Andreas Schwab from comment #4) > no_more is set in both branches. ----do you mean here: no_more = nip == (service_user *) -1l; while (no_more == 0) when call again, nip is 0x0, and no_more is also zero;
nip cannot be 0. It is either a valid function address or -1.
(In reply to Andreas Schwab from comment #6) > nip cannot be 0. It is either a valid function address or -1. Sorry,I haven't watched all before, but it did happen fct.l is zero and nip is a valid function address . (gdb) info local startp_initialized = true startp = 0x58b26143a952eb88 start_fct = 0x58b23416ed3e4548 nip = 0xffffac000b20 do_merge = 0 mergegrp = <optimized out> mergebuf = 0x0 endptr = 0x0 fct = {l = 0x0, ptr = 0x0} no_more = <optimized out> err = <optimized out> status = NSS_STATUS_UNAVAIL nscd_status = <optimized out> res = <optimized out> (gdb) q I guess this is caused by multi-threaded calls ?
What do 0x58b26143a952eb88 and 0x58b23416ed3e4548 demangle to?
nip should of course be a valid heap address or -1.
(In reply to Andreas Schwab from comment #9) > nip should of course be a valid heap address or -1. Sorry, i removed all my edits,and tried for a long time,it finally reappeared. (gdb) p __pointer_chk_guard $1 = 8256925845475092582 (gdb) p * __pointer_chk_guard Cannot access memory at address 0x72967e4f3448e866 (gdb) p/x __pointer_chk_guard $2 = 0x72967e4f3448e866 (gdb) bt #0 0x72967e4f3448e866 in ?? () #1 0x0000ffff980683a4 in __getpwnam_r (name=0xffff984ca940 <g_logOwner> "snasuser", resbuf=0xffff975be868, buffer=0xffff975be898 "", buflen=96, result=0x0) at ../nss/getXXbyYY_r.c:315 #2 0x0000ffff983077d4 in __LOG_LogSetOwner (fd=7) at base_log.c:1778 #3 0x0000ffff98308cc0 in __LOG_Common_LogSelfTask (pArgv=0xffff991a1df8 <g_stLogUserAuditMgt>) at base_log.c:2588 #4 0x0000ffff992ee8bc in start_thread (arg=0xfffffcac7aff) at pthread_create.c:486 #5 0x0000ffff980989dc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 (gdb) info local startp_initialized = true startp = 0x7296d4e5ca99a906 start_fct = 0x729681b0a29137c6 nip = 0x72967e4f3448e866 do_merge = 0 mergegrp = <optimized out> mergebuf = 0xffff9841a658 "V100R001C01" endptr = 0x0 fct = {l = 0x72967e4f3448e866, ptr = 0x72967e4f3448e866} no_more = 0 err = <optimized out> status = NSS_STATUS_UNAVAIL nscd_status = <optimized out> res = <optimized out> (gdb) exit nip is 0x72967e4f3448e866 after PTR_DEMANGLE, so i guess nip is zero before PTR_DEMANGLE?
Most likely you have some memory corruption. Try valgrind.
(In reply to Andreas Schwab from comment #11) > Most likely you have some memory corruption. Try valgrind. Thanks, I will check it. And there is also a message ,i tried add atomic_read_barrier() before nip assigned, it hasn't appeared in 2 weeks.If nothing changes, it can appear 4 or 5 times in 2 weeks. -------------------------------------- libc_hidden_proto (DB_LOOKUP_FCT) else { + atomic_read_barrier(); fct.l = start_fct; nip = startp; #ifdef PTR_DEMANGLE PTR_DEMANGLE (fct.l); PTR_DEMANGLE (nip); #endif no_more = nip == (service_user *) -1l; }
(In reply to liaichun from comment #12) > (In reply to Andreas Schwab from comment #11) > > Most likely you have some memory corruption. Try valgrind. > > Thanks, I will check it. > And there is also a message ,i tried add atomic_read_barrier() before nip > assigned, it hasn't appeared in 2 weeks.If nothing changes, it can appear 4 > or 5 times in 2 weeks. > > -------------------------------------- > libc_hidden_proto (DB_LOOKUP_FCT) > else > { > + atomic_read_barrier(); > fct.l = start_fct; > nip = startp; > #ifdef PTR_DEMANGLE > PTR_DEMANGLE (fct.l); > PTR_DEMANGLE (nip); > #endif > no_more = nip == (service_user *) -1l; > } ================================================================ I Try valgrind, but nothing found. And the same coredump occurs several times ... If i add atomic_read_barrier() before nip assigned, and that never happened again.
Created attachment 12757 [details] nss-make-sure-startp_initialized-do-first