This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug time/16145] localtime_r etc holds lock via __tz_convert

From: "dgotwisn at netscout dot com" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Sat, 14 Jan 2017 02:49:25 +0000
Subject: [Bug time/16145] localtime_r etc holds lock via __tz_convert
Auto-submitted: auto-generated
References: <bug-16145-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=16145

Dave Gotwisner <dgotwisn at netscout dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dgotwisn at netscout dot com

--- Comment #7 from Dave Gotwisner <dgotwisn at netscout dot com> ---
gcc (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Linux nfw-devperf-01.newwireless.com 2.6.32-431.23.3.el6.x86_64 #1 SMP Thu Jul
31 17:20:51 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Libc version libc-2.12.so

Running perf top to look at what's going on in my app shows 7% of CPU in
_tz_convert and 5% in spin lock, with no locks being done in my code.  This
means that we are spending almost as much time in spin lock (without ever
waiting on it) as we are in the rest of _tz_convert.

The code in question is has 3 identical threads:  tight loop iterating (c++11
for (auto &it: mymap)) a std::multimap with a significant number of lines of
code, only a few of which call into libc (or any other library).  What goes
into the libraries are array accessors and iterator accessors to the
std::multimap (which has no locks) and the following code block:

    struct tm t;
    localtime_r(&nowTime.tv_sec, &t).

Nowtime is a struct timeval (previously filled by gettimeofday() calls by
another program run days before).

This is called once per record (my test case has We also call once per record,
there are 157,669,045 records (for my small test sample)).

For every record, we also call fwrite_unlocked(), which according to the manual
page doesn't have a lock (and by the name).

In looking inside localtime_r via
http://code.metager.de/source/xref/gnu/glibc/time/localtime.c, the code is as
follows:

struct tm *
localtime (const time_t *t)
{
  return __tz_convert (t, 1, &_tmbuf);
}
libc_hidden_def (localtime)

Drilling down into __tz_convert (website
http://osxr.org:8080/glibc/source/time/tzset.c?v=glibc-2.18), we have the
following block of code:


   __libc_lock_lock (tzset_lock);

   /* Update internal database according to current TZ setting.
      POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname.
      This is a good idea since this allows at least a bit more parallelism. 
*/
   tzset_internal (tp == &_tmbuf && use_localtime, 1);

followed by much more code that access tp and calls various other _tz_
routines.

The correct solution should only be to lock if tp == &tmbuf (or the equivalent
for gmtime vs gmtime_r).

The lock prevents the parallelism mentioned in the comment.

To test this, create a routine that iterates for a long time calling
localtime_r for a given time_t value (it doesn't matter what the value is, as
long as it's valid).  Then view with "perf top -p <pid>" of the program.  Run
this with permissions so you can see into the kernel.  Do this in multiple
threads.

A routine that uses a user-provided buffer and has no shared resources should
never need a lock.  If "spin lock" shows up, the bug exists.  If it doesn't
show up, the bug was fixed.



Looking with 2.24, we still have a similar problem.  We aren't doing any
timezone manipulation, so presumably __use_tzfile is 0. The new critical code
block is:

 __libc_lock_lock (tzset_lock);

  /* Update internal database according to current TZ setting.
     POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname.
     This is a good idea since this allows at least a bit more parallelism.  */
  tzset_internal (tp == &_tmbuf && use_localtime, 1);

  if (__use_tzfile)
    __tzfile_compute (*timer, use_localtime, &leap_correction,
                      &leap_extra_secs, tp);
  else
    {
      if (! __offtime (timer, 0, tp))
        tp = NULL;
      else
        __tz_compute (*timer, tp, use_localtime);
      leap_correction = 0L;
      leap_extra_secs = 0;
    }

  __libc_lock_unlock (tzset_lock);


So, tzset_internal shouldn't need to be locked according to the comment.
__offtime() should be safe, the only external variables/routines accessed are
__set_errno() on error (not in my case), and __mon_yday[__isleap(y], all of
which should be safe.  __tz_compute() if called, might need a lock, but only
when accessing tz_rules (not sure if it does or doesn't).  Unless there was an
error inside __offtime(), __tz_compute shouldn't be called anyway, so it would
be better to refactor the code to not lock at all for the majority case.  It's
probably better to only lock for __tz_compute, and possibly inside
tzset_internal if the parameters indicate as much.  That way, the majority of
the callers of localtime_r, gmtime, and gmtime_r are uneffected.

With the 2.12 version of libc we use, almost 50% of the time inside of
__tz_convert is a result of the spin lock (5% vs 7% for the rest of
__tz_convert.

Also note that my problem (the 5% vs. 7%) apply to a single thread, so the spin
lock overhead is if it never needs to wait for the lock.  Multi-threaded apps
that have contention will be way worse (3 threads is 11% spin lock, 7%
_tz_convert for some of it, and then to the 5%/7%).  We also spend 4% in
__offtime.

For a single threaded version, I have the following profile (44% in my code --
which does a lot; 8% in __strcmp_sse2 -- from tzset_internal , 5% in some stl
stuff; 4% in __offtime, 2.5% in __tz_convert, 2% in __mempcpy (probably related
to the timezone parser); 2% in __tzfile_compute, 1% in __tzstring, and a bunch
of lesser stuff (some related to localtime and it's children).  So, at a
minimum from these numbers, 24.5% of my cycles are being consumed in a single
function, the rest of my code does a significant amount of string formatting
and I/O, the equivalent of an snprintf with about 15 numerical fields (int and
double) and a date/time output.  Given the number of samples that will be
processed in production (10s of billions, not 150 million), a 10% improvement
of localtime_r is important (there aren't a whole lot of wasted cycles left for
the rest of it).  The multi-threaded version will also be processing 10s of
billions of records, which is why the lock being removed is critical.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]