Bug 25951

Summary: support for parallel processing?
Product: dwz Reporter: Samuel Thibault <samuel.thibault>
Component: defaultAssignee: Nobody <nobody>
Status: NEW ---    
Severity: enhancement CC: dwz, jakub, mliska, sam, vries
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Demonstrator patch
Demonstator source file using seperate reaper/coordinator

Description Samuel Thibault 2020-05-08 12:52:43 UTC
Hello,

When applied on big packages (e.g. libreoffice), dwz takes a very long time, while this could be parallelized. Of course the inter-ELF factorization would be difficult to parallelize, but at least runs without the -m option, and even with the -m option the first step that deduplicates in each ELF separately, could be parallelized probably quite easily.

Samuel
Comment 1 Tom de Vries 2021-03-10 10:19:44 UTC
Created attachment 13297 [details]
Demonstrator patch

This demonstrator patch implements a simple form of multithreading, which only works without:
- multifile (-m)
- hardlink (-h)
- low-mem limit 0 (-l0)

If a file hits the low-mem limit during the parallel phase, it's rerun in low-mem mode after the parallel phase.

It passes the test-suite.  There is only one thread-sanitizer warning left, for multiple assignment of dwz_oom to obstack_alloc_failed_handler.

I did a build of the libreoffice package on openSUSE with dwz disabled, harvested the resulting .debug files (in total 175 files, 685MB), and did a dwz run (without multifile) using those files.

With master:
...
maxmem: 714956
real: 17.77
user: 15.76
system: 0.50
...

With the patch on top of master:
...
maxmem: 1106516
real: 10.37
user: 20.59
system: 1.46
...

So, the trade off is as expected: faster realtime, but higher peak memory.

DWZ though contains the low-mem mode to keep memory usage in check, such that dwz can be used on 32-bit systems, with still relatively large files.  So the trade off on those systems may not be advantageous.  We could fix this by not enabling parallel processing on such systems.

OTOH, we could also spawn processes instead of threads.  That means the per-process peak memory does not increase.  It would also mean less messy code changes (not having to use __thread all over the place).

An initial version that wouldn't deal with multifile (like this demonstrator patch) wouldn't need much changes.  A version that would support multifile would need a switch to indicate the location of the dwz.debug_info etc files.  So, something like:
...
$ dwz -m 3 1 2
 create temp dir /tmp/abcdef
 spawn dwz 1 --multifile-dir /tmp/abcdef
 spawn dwz 2 --multifile-dir /tmp/abcdef
 wait for 2 spawned processes to finish ...
 spawned dwz 1 - compressing
 spawned dwz 2 - compressing
 spawned dwz 1 - multifile write (using dir /tmp/abcdef)
 spawned dwz 2 - multifile write (using dir /tmp/abcdef)
 spawned dwz 1 - done
 spawned dwz 2 - done
 waiting done
 multifile optimize (using files in /tmp/abcdef)
 multifile read
 multifile finalize 1
 multifile finalize 2
...
Comment 2 Tom de Vries 2021-03-23 20:22:21 UTC
Posted RFC: https://sourceware.org/pipermail/dwz/2021q1/001166.html
Comment 3 Tom de Vries 2021-03-26 11:47:25 UTC
(In reply to Tom de Vries from comment #2)
> Posted RFC: https://sourceware.org/pipermail/dwz/2021q1/001166.html

And committed at https://sourceware.org/git/?p=dwz.git;a=commit;h=7755593c86b701547ec276320533efc3e4c165f3 .

Note that this still does not apply when multifile is used.
Comment 4 Jakub Jelinek 2021-03-26 11:51:30 UTC
For multifile, perhaps each fork could fill in its own set of multifiles and then they'd be merged together before being processed.
But we need to ensure reproduceability, so the order in which the multifile chunks from different programs/shared libraries are merged back needs to be independent on the number of forks.
Comment 5 Tom de Vries 2021-03-26 16:42:54 UTC
(In reply to Jakub Jelinek from comment #4)
> For multifile, perhaps each fork could fill in its own set of multifiles and
> then they'd be merged together before being processed.
> But we need to ensure reproduceability, so the order in which the multifile
> chunks from different programs/shared libraries are merged back needs to be
> independent on the number of forks.

I've posted a first parallel+multifile implementation, that does not yet have reproduceability (though it does have reproducible compression AFAIU): https://sourceware.org/pipermail/dwz/2021q1/001197.html .
Comment 6 Tom de Vries 2021-03-31 07:18:45 UTC
(In reply to Tom de Vries from comment #3)
> (In reply to Tom de Vries from comment #2)
> > Posted RFC: https://sourceware.org/pipermail/dwz/2021q1/001166.html
> 
> And committed at
> https://sourceware.org/git/?p=dwz.git;a=commit;
> h=7755593c86b701547ec276320533efc3e4c165f3 .
> 
> Note that this still does not apply when multifile is used.

And committed: https://sourceware.org/git/?p=dwz.git;a=commit;h=64ea1adcda52d22f00f17e219bc8e023b62b9a03 .

Now -j works for multifile as well, provided -e and -p are used.
Comment 7 Tom de Vries 2021-04-12 08:22:20 UTC
Created attachment 13362 [details]
Demonstator source file using seperate reaper/coordinator

(In reply to Tom de Vries from comment #6)
> Now -j works for multifile as well, provided -e and -p are used.

For the last step, to make multifile work with -j without -e/-p, the communication scheme needs to be more elaborate.

The parent needs to both:
- reap the children
- communicate with the children about the multifile

It cannot do both tasks in blocking fashion. It could do them in a non-blocking fashion, but then you have busy wait, which is bad.

The solution I came up with is to have the parent spawn a seperate process, the coordinator.

Then the job of the parent is to reap children.

The job of the coordinator is to communicate with the children about the multifile: the children request permission to contribute to the multifile, with a certain type endian/pointer-size.  The coordinator replies back whether and when that's ok.

When the parent reaps a child, it notifies the coordinator to ensure that the coordinator is not stuck on waiting for a request from that child.