This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug network/18163] New: SIGSEGV in getservbyname_r - read barrier missing


https://sourceware.org/bugzilla/show_bug.cgi?id=18163

            Bug ID: 18163
           Summary: SIGSEGV in getservbyname_r - read barrier missing
           Product: glibc
           Version: 2.21
            Status: NEW
          Severity: normal
          Priority: P2
         Component: network
          Assignee: unassigned at sourceware dot org
          Reporter: thomas.stuefe at gmail dot com

Dear all,

I'm looking into a crash on powerpc 64bit. We crash with a SIGSEGV inside
getservbyname_r(). The crash happened with a glibc 2.11.3, on a Suse Linux. 

I may have found the cause and if I am right it still persists in the latest
sources (I checked 2.21 as well as the git repository).

Similar crashes did happen for us on Linux ia64. The crashes are extremely rare
(once every month or so) and not reproducable. We run a large number of tests
every night on a range of platforms, and only powerpc64 and ia64 showed those
crashes.

I dug into the core file on powerpc and I think a race may happen caused by a
missing read barrier. 

The code is inside nss/getXXbyYY_r.c

I believe the value of "start_fct" is read before "startp_initialized".

When initializing, writes to "start_fct" and "startp_initialized" are followed
by a write barrier before "startp_initialized" is set (see getXXbyYY_r.c:246).

However, when reading those values, there is no read barrier which would force
"startp_initialized" to be read before "start_fct". The assembly generated by
gcc for this looks like this:

A)   0x00000fff9aa632cc <.getservbyname_r+236>:   ld      r31,-13240(r2)
B)   0x00000fff9aa632d0 <.getservbyname_r+240>:   lbz     r0,0(r31)
     0x00000fff9aa632d4 <.getservbyname_r+244>:   cmpwi   cr7,r0,0
     0x00000fff9aa632d8 <.getservbyname_r+248>:   beq-    cr7,0xfff9aa63450
<.getservbyname_r+624>
C)   0x00000fff9aa632dc <.getservbyname_r+252>:   ld      r11,16(r31)
D)   0x00000fff9aa632e0 <.getservbyname_r+256>:   ld      r9,8(r31)

in (A) we get the address of "startp_initialized" from the Toc, in (B) we load
the value of "startp_initialized" and, if 0, conditionally jump away. Branch is
hinted with "-", so, processor is told to expect the condition to be false.

in (C) and (D) "start_fct" and "startp" are read from memory.

I believe that it could happen that (C) and/or (D) are loaded before (A), so we
could read a stale value of "start_fct" and "startp", in my case NULL. That
could explain all the details in my core.

The crashes are rare and seen only on architectures which have weak memory
ordering, which also fits the pattern.

Unfortunately, I have no time left to dig further into this problem, so I can
only suggest a "dry" patch which I cannot check:

--- nss/getXXbyYY_r.c   2015-03-25 10:38:04.513841300 +0100
+++ nss/getXXbyYY_r.c   2015-03-25 16:25:05.973676800 +0100
@@ -149,6 +149,7 @@
                           EXTRA_PARAMS)
 {
   static bool startp_initialized;
+  bool is_initialized;
   static service_user *startp;
   static lookup_function start_fct;
   service_user *nip;
@@ -200,7 +201,11 @@
     }
 #endif

-  if (! startp_initialized)
+  /* make sure to read startp_initialized before
+   * startp and start_fct */
+  is_initialized = startp_initialized;
+  atomic_read_barrier();
+  if (!is_initialized)
     {
       no_more = DB_LOOKUP_FCT (&nip, REENTRANT_NAME_STRING,
                               REENTRANT2_NAME_STRING, &fct.ptr);

Regards,
Thomas Stuefe

-- 
You are receiving this mail because:
You are on the CC list for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]