19427 – Intern the strings used in Libabigail

Bug 19427 - Intern the strings used in Libabigail

Summary: Intern the strings used in Libabigail

Status:	RESOLVED FIXED

Alias:	None

Product:	libabigail
Classification:	Unclassified
Component:	default (show other bugs)
Version:	unspecified

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Dodji Seketeli

URL:
Keywords:

Depends on:
Blocks:	19355
	Show dependency tree / graph

Reported:	2016-01-05 10:25 UTC by Dodji Seketeli
Modified:	2016-02-24 23:01 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dodji Seketeli 2016-01-05 10:25:53 UTC

From the performance work related to bug #19355, it appears that we could significantly improve the performance of emitting the XML ABI representation of a binary by interning the strings (https://en.wikipedia.org/wiki/String_interning) used to represent the names and qualified names of types throughout the internal representation of the Library.

So the idea is to introduce a new abigail::intern_string type, make it inter-operable with the std::string used elsewhere in the library, provide a string intern pool that is accessible from the current instance of abigail::environment, and use that pool to create the interned strings for a given run of the library that uses a given environment.

Further profiling would tell us if other spots in the library would benefit from using string interning.

Comment 1 Dodji Seketeli 2016-02-09 15:58:17 UTC

So I started to work on this and I do have a working branch (named 'str-intern') with the necessary changes undocumented (yet).  You can browse it at https://sourceware.org/git/gitweb.cgi?p=libabigail.git;a=shortlog;h=refs/heads/dodji/str-intern.

I am going to post time and memory consumption comparison using the code base that is built with optimization (-O2).

So here is the resource usage of abidw --abidiff on r300_dri.so, for the master branch:

real => 5:03.31
user => 300.24
sys => 2.86
max mem => 4959036KB

And the resource usage for the str-intern branch:

real => 4:56.98
user => 294.12
sys => 2.65
max mem => 4617328KB

So, as you can see, it slightly improves the speed of this test (by 6 seconds), and significantly improves memory usage (saving more than 300 mega bytes of memory).

The problem is that, of smaller tests and tests that don't involve emitting abixml, things are a little bit slower, actually.  In other words, abidiff, for instance, becomes slightly slower.  The memory consumption savings are still there though.

That is, the cost of looking up strings in a hash table to ensure that each string exists in only one copy in the environment (this is string interning) makes the loading of abi corpora slower.  But then, comparing *strings* later becomes faster as comparing two strings amounts to just comparing two pointers.  But we need to compare a lot of strings to make up for the cost of interning them in the first place.  And the place where we compare strings the most at the moment is when we emit abixml (i.e, in abidw).

During decls comparisons it turns out we don't compare strings that much because we compare their types first.  And thanks to type canonicalization, comparing two types is very fast.  And as the majority of comparisons yield a negative result, we don't even get to compare the names of the decls.

So I am still not sure if I am going to incorporate this optimization in the end.  I *am* inclined to merge it, because it makes the library consume less memory, and it speeds up abixml writing, especially for big libraries.  In other words, it makes libabigail scale more.  But then it slows it slightly on small workloads (which are quite fast anyway).

I'll give this a little bit more thought.

But in the mean time, if you have some thoughts, please share them :-)

Comment 2 Dodji Seketeli 2016-02-24 15:36:32 UTC

I have finally merged the patch that performs this optimization.

It's commit https://sourceware.org/git/gitweb.cgi?p=libabigail.git;a=commit;h=cf8eba68c363c1d4159e8a9e4920b14d0d54bb9c in the master branch.

This should be available in the 1.0.rc3 release.

Comment 3 Michi Henning 2016-02-24 21:57:08 UTC

Not that this is any of my business :-)

The 2% speed difference is insignificant. No-one will care.

The memory savings are nice, but not *that* huge (around 7%). I'd be inclined to make this kind of change only if the impact on code complexity and maintainability is low.

I've worked with code bases that create their own string implementation in the past and, usually, it creates more problems than it solves. Often, the problem ends up being that the replacement cannot be easily exchanged with std::string and vice-versa, and then ends up costing more than it saves. So, whenever I see a string replacement class, I tend to look twice.

I don't know whether you can use boost in your code. If so, boost::flyweight might be a useful approach.

Comment 4 Dodji Seketeli 2016-02-24 23:01:22 UTC

> Not that this is any of my business :-)

It is, otherwise, this would not be an open project ;-)

> The 2% speed difference is insignificant. No-one will care.
>
> The memory savings are nice, but not *that* huge (around 7%). I'd be inclined
> to make this kind of change only if the impact on code complexity and
> maintainability is low.

Actually, someone brought to my attention a test case where we have 30%
speed increase ;-)

This is on the nss package, from mozilla.  That is why I merged it in.
In other words, it makes the library scale a lot more on packages with
deep aggregate type hierarchies where each aggregate type has a lot of data
members, and where sub-types reference each other.  If you also have a
lot of incomplete types in the mix, then you end up doing a *lot* of
type name comparisons.  And this is where interned strings help.

> I've worked with code bases that create their own string implementation in the
> past and, usually, it creates more problems than it solves. Often, the problem
> ends up being that the replacement cannot be easily exchanged with std::string
> and vice-versa, and then ends up costing more than it saves. So, whenever I see
> a string replacement class, I tend to look twice.

Right, I totally agree.

This is why I tried very hard to make abigail::interned_string to be
just a pointer to std::string, in essence.  It's not a new string.  The
idea is just that comparing two instances or interned_string amounts to
comparing the values of the pointers to std::string, as there is only
one copy of each distinct std::string in the entire environment now.

> I don't know whether you can use boost in your code.

As this thing should work on systems that are not updated that much, for
instance, el6, we don't use boost, no.