This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Malloc improvements

From: DJ Delorie <dj at redhat dot com>
To: Anton Blanchard <anton at au1 dot ibm dot com>
Cc: carlos at redhat dot com, sid at reserved-bit dot com, libc-alpha at sourceware dot org
Date: Thu, 14 Jul 2016 15:02:57 -0400
Subject: Re: Malloc improvements
Authentication-results: sourceware.org; auth=none

Anton Blanchard <anton@au1.ibm.com> writes:
> Looks good. Is the plan to ship around the *.out or *.wl files?

The trace files are one record per call, with path information that
tells us which code in glibc was used to service the event.  This has
great value for debugging, profiling, and ensuring test coverage.
However, the trace files are arch-specific (the down side of "fast and
binary") so their usefulness has a bit of locality to them.  I uploaded
a dumper program to let you convert them into the older (portable) text
format if they need to be saved or shared.

The workload file is specifically for the simulator.  It's portable and
smaller, but encoded such that it can be mmap'd and each running thread
has a separate region of the map to process.

My plan, aside from having a tool for debugging, was to archive the
workload files somewhere and use them in a regression test framework so
that "performance" changes to malloc can be tested against
representatives of all the major memory-intensive applications we care
about - omnetpp, libreoffice, 389ds, postgreql, emacs, etc - without
having to actually run (or even install) those applications.  This is
especially handy for apps that require user interaction or multiple
hosts, or even a live environment.

It also allows us to capture a workload of a proprietary application
that has performance issues, without needing to obtain said application
- the person complaining can submit a workload from their trace, showing
that the bug can be reproduced against the workload.  It makes bug
reports more complete and stand-alone.

> There isn't much difference after compression, but it took ages to
> compress the *.out file. Not surprisingly the *.wl file compressed much
> faster.

Probably just due to size.  The relative information contained in the
two files is the same, I would expect the compressed versions to be
similar.  The trace file is a fixed-record format with lots of zeros,
where the workload is a variable-size format.

> I was thinking about single threaded traces, perhaps we could avoid all
> the locking in that case. My tests show avoiding the locking is about 4x
> faster on the omnetpp trace on POWER8.

Hmmm... a single-threaded app shouldn't generate any inter-thread locks
in the simulator, since the locks only happen to synchronize pointers
which move from one thread to another.  Also, these locks would have to
happen in the original app too, so preserving them helps simulate the
"flavor" of the workload (sequence of events, overlapping calls when
multithreaded, paging, etc), but it's possible I have redundant locks
being generated.  Remember, the goal here isn't to simulate the fastest,
it's to capture a representative of the original application.  The
original application had much more CPU, and locking activity, which
changed the timings of the calls (I'm thinking, for multi-threaded apps)
and while capturing the essence of multi-threaded timing is, well, let's
say impossible - leaving in some code to "represent" the timing isn't
that bad.

Having said that, anything we can do to reduce the noise and make the
glibc calls a larger percentage of the runtime, would be good, as long
as the flavor of the app is preserved.

> As well as the locking, the memory initialisation loops were showing
> up in profiles. Is there a reason for encoding the offset in
> free_wipe()? If not we can just use memset() which is much faster.

Oh, that's just for my debugging.  I was filling with known constants to
instrument my core dumps with what was happening.  Debugging a crash
caused by wrong-sized assumptions about chunks is a lot easier when the
words in the chunk are numbered :-)

The "production" version has wmem touching one byte per cache line in
wmem, not filling the whole thing or filling on free.  Just wanted to
try to replicate some of the kernel paging activity if I could.

I haven't started adding command line options to the simulator yet, but
a patch to do so and use that to switch wmem/free_wipe between debug and
production mode is pre-approved if you feel like adding it.  I've been
in a bit of a rush to fix the bugs it's been finding, so I just hack the
sources and recompile while debugging ;-)

References:
- Re: Malloc improvements
  - From: Anton Blanchard

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]