This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug network/18163] New: SIGSEGV in getservbyname_r - read barrier missing
- From: "thomas.stuefe at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Wed, 25 Mar 2015 15:35:50 +0000
- Subject: [Bug network/18163] New: SIGSEGV in getservbyname_r - read barrier missing
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=18163
Bug ID: 18163
Summary: SIGSEGV in getservbyname_r - read barrier missing
Product: glibc
Version: 2.21
Status: NEW
Severity: normal
Priority: P2
Component: network
Assignee: unassigned at sourceware dot org
Reporter: thomas.stuefe at gmail dot com
Dear all,
I'm looking into a crash on powerpc 64bit. We crash with a SIGSEGV inside
getservbyname_r(). The crash happened with a glibc 2.11.3, on a Suse Linux.
I may have found the cause and if I am right it still persists in the latest
sources (I checked 2.21 as well as the git repository).
Similar crashes did happen for us on Linux ia64. The crashes are extremely rare
(once every month or so) and not reproducable. We run a large number of tests
every night on a range of platforms, and only powerpc64 and ia64 showed those
crashes.
I dug into the core file on powerpc and I think a race may happen caused by a
missing read barrier.
The code is inside nss/getXXbyYY_r.c
I believe the value of "start_fct" is read before "startp_initialized".
When initializing, writes to "start_fct" and "startp_initialized" are followed
by a write barrier before "startp_initialized" is set (see getXXbyYY_r.c:246).
However, when reading those values, there is no read barrier which would force
"startp_initialized" to be read before "start_fct". The assembly generated by
gcc for this looks like this:
A) 0x00000fff9aa632cc <.getservbyname_r+236>: ld r31,-13240(r2)
B) 0x00000fff9aa632d0 <.getservbyname_r+240>: lbz r0,0(r31)
0x00000fff9aa632d4 <.getservbyname_r+244>: cmpwi cr7,r0,0
0x00000fff9aa632d8 <.getservbyname_r+248>: beq- cr7,0xfff9aa63450
<.getservbyname_r+624>
C) 0x00000fff9aa632dc <.getservbyname_r+252>: ld r11,16(r31)
D) 0x00000fff9aa632e0 <.getservbyname_r+256>: ld r9,8(r31)
in (A) we get the address of "startp_initialized" from the Toc, in (B) we load
the value of "startp_initialized" and, if 0, conditionally jump away. Branch is
hinted with "-", so, processor is told to expect the condition to be false.
in (C) and (D) "start_fct" and "startp" are read from memory.
I believe that it could happen that (C) and/or (D) are loaded before (A), so we
could read a stale value of "start_fct" and "startp", in my case NULL. That
could explain all the details in my core.
The crashes are rare and seen only on architectures which have weak memory
ordering, which also fits the pattern.
Unfortunately, I have no time left to dig further into this problem, so I can
only suggest a "dry" patch which I cannot check:
--- nss/getXXbyYY_r.c 2015-03-25 10:38:04.513841300 +0100
+++ nss/getXXbyYY_r.c 2015-03-25 16:25:05.973676800 +0100
@@ -149,6 +149,7 @@
EXTRA_PARAMS)
{
static bool startp_initialized;
+ bool is_initialized;
static service_user *startp;
static lookup_function start_fct;
service_user *nip;
@@ -200,7 +201,11 @@
}
#endif
- if (! startp_initialized)
+ /* make sure to read startp_initialized before
+ * startp and start_fct */
+ is_initialized = startp_initialized;
+ atomic_read_barrier();
+ if (!is_initialized)
{
no_more = DB_LOOKUP_FCT (&nip, REENTRANT_NAME_STRING,
REENTRANT2_NAME_STRING, &fct.ptr);
Regards,
Thomas Stuefe
--
You are receiving this mail because:
You are on the CC list for the bug.