Bug 27557 - Use threading at sub-file level
Summary: Use threading at sub-file level
Status: NEW
Alias: None
Product: dwz
Classification: Unclassified
Component: default (show other bugs)
Version: unspecified
: P2 enhancement
Target Milestone: ---
Assignee: Nobody
URL:
Keywords:
Depends on: 27568
Blocks:
  Show dependency treegraph
 
Reported: 2021-03-10 13:36 UTC by Tom de Vries
Modified: 2024-02-03 13:30 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Demonstrator patch (2.62 KB, patch)
2021-03-10 14:31 UTC, Tom de Vries
Details | Diff
Updated demonstrator patch (3.57 KB, patch)
2021-03-10 15:11 UTC, Tom de Vries
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tom de Vries 2021-03-10 13:36:25 UTC
Using parallelism we can speed up dwz.

We can introduce parallelism at file-level.  PR25951 is about that.  The problem there is that doing so also increases memory footprint.

We can also use parallelism at sub-file level: we still operate on one file at a time (so we don't increase memory footprint), but try to use threads to speed things up.
Comment 1 Tom de Vries 2021-03-10 14:31:12 UTC
Created attachment 13300 [details]
Demonstrator patch

Uses 4 threads to do checksum_die in parallel. Runs into trouble with loc_htab/loclists_htab, so uses mutexes to guard accesses to those.

Passes testsuite with -fsanitize=threads.

Results on cc1: same memory usage, 20% more usertime, 8.7% less realtime.
...
real:
series:  5590 5620 5670 5710 5680 5620 5660 5650 6040 5680
mean:  5692.00 (100%)
stddev:  127.26
series:  5130 5180 5160 5140 5230 5170 5140 5600 5110 5100
mean:  5196.00 (91.29%)
stddev:  146.76

user:
series:  5300 5320 5330 5420 5370 5320 5310 5330 5330 5370
mean:  5340.00 (100%)
stddev:  36.21
series:  6460 6420 6440 6450 6530 6430 6420 6460 6430 6360
mean:  6440.00 (120.60%)
stddev:  42.69

sys:
series:  280 290 330 280 300 290 340 310 350 300
mean:  307.00 (100%)
stddev:  24.97
series:  280 350 310 290 310 330 320 340 280 340
mean:  315.00 (102.61%)
stddev:  25.50

mem:
series:  1260412 1260472 1260468 1260352 1260532 1260528 1260592 1260416 1260416 1260352
mean:  1260454.00 (100%)
stddev:  79.29
series:  1260148 1260332 1260144 1260156 1260016 1260012 1260508 1260152 1260120 1260256
mean:  1260184.40 (99.98%)
stddev:  148.57
...

Known problem: When marking inter-CU refs in read_exprloc, we need to ensure that there's no other thread accessing that DIE.  Should not be typical input though.
Comment 2 Tom de Vries 2021-03-10 15:11:17 UTC
Created attachment 13301 [details]
Updated demonstrator patch

Now uses 8 threads, also does checksum_ref_die threaded.

Slightly better results: realtime now 14.7% less:
...
real:
series:  6130 6150 6160 6140 6150 6200 6230 6170 6220 6240
mean:  6179.00 (100%)
stddev:  40.12
series:  5220 5200 5220 5330 5220 5190 5430 5210 5200 5510
mean:  5273.00 (85.34%)
stddev:  112.55

user:
series:  5720 5790 5830 5730 5790 5850 5880 5820 5820 5840
mean:  5807.00 (100%)
stddev:  50.78
series:  7200 7170 7160 7240 7210 7090 7120 7160 7140 7070
mean:  7156.00 (123.23%)
stddev:  53.17

sys:
series:  400 340 320 390 350 340 340 340 390 390
mean:  360.00 (100%)
stddev:  29.06
series:  350 350 370 340 300 330 400 330 300 370
mean:  344.00 (95.56%)
stddev:  31.34

mem:
series:  1260484 1260480 1260420 1260412 1260416 1260540 1260420 1260548 1260656 1260420
mean:  1260479.60 (100%)
stddev:  80.78
series:  1260928 1260940 1260732 1260876 1260696 1260740 1260808 1260924 1260760 1260996
mean:  1260840.00 (100.03%)
stddev:  105.54
...

Additional known problem: Inter-CU references during checksum_ref_die.  Again, should be rare.
Comment 3 Tom de Vries 2021-03-12 12:35:38 UTC
An interesting question is whether find_dups can be parallelized.

The common resource that is used, is the dup_htab. For an actual insertion, the htab as a whole needs to be locked.

For comparing, comparing DIEs with different checksums can be done in parallel.  
So, we could fold the ref_checksum to a byte, and use that as in index into an array of 256 locks.  The problem however is the optimistic marking of DIEs as duplicates when following references.  Marking those optimistically would also require acquiring the lock for the corresponding checksum, which opens the door to deadlock problems. We could mitigate for that by detecting cycles upfront, and avoiding parallel processing of those, but that would again reduce effectivity.
Comment 4 Tom de Vries 2021-03-12 12:39:43 UTC
In order to assess effectiveness better, we need --devel-progress to emit realtime, in addition to user and sys.

Then we can determine how much the reduction in realtime is of the .debug_info reading phase, rather than of the overall realtime as reported by the time command.