Using parallelism we can speed up dwz. We can introduce parallelism at file-level. PR25951 is about that. The problem there is that doing so also increases memory footprint. We can also use parallelism at sub-file level: we still operate on one file at a time (so we don't increase memory footprint), but try to use threads to speed things up.
Created attachment 13300 [details] Demonstrator patch Uses 4 threads to do checksum_die in parallel. Runs into trouble with loc_htab/loclists_htab, so uses mutexes to guard accesses to those. Passes testsuite with -fsanitize=threads. Results on cc1: same memory usage, 20% more usertime, 8.7% less realtime. ... real: series: 5590 5620 5670 5710 5680 5620 5660 5650 6040 5680 mean: 5692.00 (100%) stddev: 127.26 series: 5130 5180 5160 5140 5230 5170 5140 5600 5110 5100 mean: 5196.00 (91.29%) stddev: 146.76 user: series: 5300 5320 5330 5420 5370 5320 5310 5330 5330 5370 mean: 5340.00 (100%) stddev: 36.21 series: 6460 6420 6440 6450 6530 6430 6420 6460 6430 6360 mean: 6440.00 (120.60%) stddev: 42.69 sys: series: 280 290 330 280 300 290 340 310 350 300 mean: 307.00 (100%) stddev: 24.97 series: 280 350 310 290 310 330 320 340 280 340 mean: 315.00 (102.61%) stddev: 25.50 mem: series: 1260412 1260472 1260468 1260352 1260532 1260528 1260592 1260416 1260416 1260352 mean: 1260454.00 (100%) stddev: 79.29 series: 1260148 1260332 1260144 1260156 1260016 1260012 1260508 1260152 1260120 1260256 mean: 1260184.40 (99.98%) stddev: 148.57 ... Known problem: When marking inter-CU refs in read_exprloc, we need to ensure that there's no other thread accessing that DIE. Should not be typical input though.
Created attachment 13301 [details] Updated demonstrator patch Now uses 8 threads, also does checksum_ref_die threaded. Slightly better results: realtime now 14.7% less: ... real: series: 6130 6150 6160 6140 6150 6200 6230 6170 6220 6240 mean: 6179.00 (100%) stddev: 40.12 series: 5220 5200 5220 5330 5220 5190 5430 5210 5200 5510 mean: 5273.00 (85.34%) stddev: 112.55 user: series: 5720 5790 5830 5730 5790 5850 5880 5820 5820 5840 mean: 5807.00 (100%) stddev: 50.78 series: 7200 7170 7160 7240 7210 7090 7120 7160 7140 7070 mean: 7156.00 (123.23%) stddev: 53.17 sys: series: 400 340 320 390 350 340 340 340 390 390 mean: 360.00 (100%) stddev: 29.06 series: 350 350 370 340 300 330 400 330 300 370 mean: 344.00 (95.56%) stddev: 31.34 mem: series: 1260484 1260480 1260420 1260412 1260416 1260540 1260420 1260548 1260656 1260420 mean: 1260479.60 (100%) stddev: 80.78 series: 1260928 1260940 1260732 1260876 1260696 1260740 1260808 1260924 1260760 1260996 mean: 1260840.00 (100.03%) stddev: 105.54 ... Additional known problem: Inter-CU references during checksum_ref_die. Again, should be rare.
An interesting question is whether find_dups can be parallelized. The common resource that is used, is the dup_htab. For an actual insertion, the htab as a whole needs to be locked. For comparing, comparing DIEs with different checksums can be done in parallel. So, we could fold the ref_checksum to a byte, and use that as in index into an array of 256 locks. The problem however is the optimistic marking of DIEs as duplicates when following references. Marking those optimistically would also require acquiring the lock for the corresponding checksum, which opens the door to deadlock problems. We could mitigate for that by detecting cycles upfront, and avoiding parallel processing of those, but that would again reduce effectivity.
In order to assess effectiveness better, we need --devel-progress to emit realtime, in addition to user and sys. Then we can determine how much the reduction in realtime is of the .debug_info reading phase, rather than of the overall realtime as reported by the time command.