Bug 25276 - create_import_tree performance improvements
Summary: create_import_tree performance improvements
Status: RESOLVED FIXED
Alias: None
Product: dwz
Classification: Unclassified
Component: default (show other bugs)
Version: unspecified
: P2 enhancement
Target Milestone: ---
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-13 15:55 UTC by Tom de Vries
Modified: 2019-12-17 23:30 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom de Vries 2019-12-13 15:55:30 UTC
The test-case clang-10 as attached in PR25024, contains ~111 million DIEs.

This is larger than the default max die-limit of 50 million, but we can overcome this using -Lnone.

Using -Lnone, as well as -lnone to avoid low-mem mode, we get an execution time of:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone
maxmem: 14141844
real: 148.32
user: 131.79
system: 7.07
...

When compiling with -DDEVEL additionally, we get:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone
maxmem: 14069588
real: 148.35
user: 132.95
system: 7.31
...

When compiling with the odr patches applied, we get slightly more memory used:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --no-odr
maxmem: 14605608
real: 528.92
user: 142.62
system: 16.78
...
but 'real time' grows with a factor 3.5, presumably because of running into swap space usage.  The user time only grows with 7% though.

Finally, enabling odr in it's basic mode sees an increase of user time by a factor 3.3, in total 336 seconds extra:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --odr --odr-mode=basic
maxmem: 14513848
real: 533.34
user: 478.39
system: 12.06
...

Using a patch adding --devel-progress, we can see where we spent the time.

Without odr:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --no-odr --devel-progress
read_debug_info .debug_info
user: 58.74
sys : 1.84
partition_dups
user: 1.17
sys : 0.01
partition_dups qsort
user: 1.35
sys : 0.00
partition_dups after qsort
user: 1.37
sys : 0.35
create_import_tree phase 1
user: 0.36
sys : 0.02
create_import_tree phase 2
user: 1.33
sys : 0.00
create_import_tree phase 3
user: 20.15
sys : 0.00
create_import_tree phase 4 (create partial units)
user: 0.09
sys : 0.00
read_debug_info .debug_types
user: 1.29
sys : 0.00
compute_abbrevs
user: 36.34
sys : 0.24
write_abbrev
user: 0.09
sys : 0.00
write_info
user: 11.87
sys : 0.72
write_loc
user: 0.74
sys : 0.68
write_types
user: 0.00
sys : 0.00
write_gdb_index
user: 0.25
sys : 0.32
write_dso
user: 0.19
sys : 2.00
maxmem: 15172216
real: 174.49
user: 135.33
system: 6.41
...

And with odr:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --odr --odr-mode=basic --devel-progress   
read_debug_info .debug_info
user: 43.07
sys : 1.60
partition_dups
user: 1.88
sys : 0.01
partition_dups split_dups
user: 93.43
sys : 0.04
partition_dups qsort
user: 4.10
sys : 0.00
partition_dups after qsort
user: 3.71
sys : 0.19
create_import_tree phase 1
user: 0.89
sys : 0.01
create_import_tree phase 2
user: 48.29
sys : 0.00
create_import_tree phase 3
user: 222.30
sys : 0.00
create_import_tree phase 4 (create partial units)
user: 0.32
sys : 0.03
read_debug_info .debug_types
user: 1.35
sys : 0.00
compute_abbrevs
user: 30.36
sys : 0.09
write_abbrev
user: 0.08
sys : 0.00
write_info
user: 5.70
sys : 0.11
write_loc
user: 0.80
sys : 0.65
write_types
user: 0.00
sys : 0.00
write_gdb_index
user: 0.14
sys : 0.17
write_dso
user: 0.00
sys : 1.27
maxmem: 15491032
real: 465.45
user: 456.42
system: 4.29
...

Focussing on the larger bits, without odr:
...
read_debug_info .debug_info
user: 58.74
create_import_tree phase 3
user: 20.15
compute_abbrevs
user: 36.34
write_info
user: 11.87
...
and with odr:
...
read_debug_info .debug_info
user: 43.07
sys : 1.60
partition_dups split_dups
user: 93.43
sys : 0.04
create_import_tree phase 2
user: 48.29
sys : 0.00
create_import_tree phase 3
user: 222.30
sys : 0.00
compute_abbrevs
user: 30.36
sys : 0.09
write_info
user: 5.70
sys : 0.11
...

It's clear something is slow in odr: the split_dups bit takes 93s.

But create_import_tree phase 3 increases from 20s to 222s.

Looking at the perf data, we spend 51% of the time in dwz, and of that 51%, we
spent 43% here, in the subset/superset edge comparison loop:
...
Percent│                    e4 = ipu2->incoming;
  0,01 │        mov    %r14,%rax
       │                    while ((maybe_subset || maybe_superset) && e3 && e4)
  0,00 │1395:   mov    %esi,%edi
  0,01 │        or     %cl,%dil
  0,01 │      ↓ je     13d8
  0,02 │139c:   test   %rdx,%rdx
  0,00 │      ↓ je     1730
  0,07 │        test   %rax,%rax
  0,00 │      ↓ je     1730
       │                        if (e3->icu == e4->icu)
  0,53 │        mov    (%rdx),%r8
 35,28 │        mov    (%rax),%rdi
  0,62 │        cmp    %rdi,%r8
  0,01 │      ↓ je     1718
       │                        if (e3->icu->idx < e4->icu->idx)
  7,37 │        mov    0x28(%rdi),%edi
  0,55 │        cmp    %edi,0x28(%r8)
  0,04 │      ↓ jb     1708
       │                        maybe_superset = false;
...

This is a draft of a short-cut that might prevent going into the hot loop:
...
@@ -7683,6 +7684,26 @@ create_import_tree (void)
              maybe_subset = (e1 == ipu->incoming
                              && ipu->incoming_count <= ipu2->incoming_count);
              maybe_superset = ipu->incoming_count >= ipu2->incoming_count;
+             if (maybe_superset)
+               {
+                 /* If the referrer nodes of ipu are a superset of the
+                    referrer nodes of ipu2, then ipu's last referrer node
+                    should have index larger or equal to the last referrer
+                    node of ipu2.  */
+                 maybe_superset
+                   = (ipu->incoming[ipu->incoming_count - 1].icu->idx
+                      >= ipu2->incoming[ipu2->incoming_count - 1].icu->idx);
+               }
+             if (maybe_subset)
+               {
+                 /* If the referrer nodes of ipu are a subset of the
+                    referrer nodes of ipu2, then ipu's last referrer node
+                    should have index smaller or equal to the last referrer
+                    node of ipu2.  */
+                 maybe_subset
+                   = (ipu->incoming[ipu->incoming_count - 1].icu->idx
+                      <= ipu2->incoming[ipu2->incoming_count -1].icu->idx);
+               }
              e3 = e1;
              e4 = ipu2->incoming;
              intersection = 0;
...
but it doesn't work currently because we cannot access
ipu->incoming[ipu->incoming_count - 1] at this point. Instead, we need to walk
the next chain.  Keeping track of the tail of the edge chain would fix this,
and might yield speed improvement.

Furthermore, we spent 28% here:
...
  7,02 │1481:   movb   $0x0,0x2c(%rax)
       │            for (icu = ipu->next; icu; icu = icu->next)
 28,04 │        mov    0x18(%rax),%rax
  0,00 │        test   %rax,%rax
  0,87 │      ↑ jne    1481
...
which is this loop here:
...
      for (icu = ipu->next; icu; icu = icu->next)
        icu->seen = false;
...

It might be possible to improve here by keeping track of this data in another
way, say an array of bool or a bitvector, which could be memset to zero. It
would also be interesting to know how sparse this array is, which could also
be exploited.
Comment 1 Tom de Vries 2019-12-17 11:01:29 UTC
(In reply to Tom de Vries from comment #0)
> This is a draft of a short-cut that might prevent going into the hot loop:
> ...
> @@ -7683,6 +7684,26 @@ create_import_tree (void)
>               maybe_subset = (e1 == ipu->incoming
>                               && ipu->incoming_count <=
> ipu2->incoming_count);
>               maybe_superset = ipu->incoming_count >= ipu2->incoming_count;
> +             if (maybe_superset)
> +               {
> +                 /* If the referrer nodes of ipu are a superset of the
> +                    referrer nodes of ipu2, then ipu's last referrer node
> +                    should have index larger or equal to the last referrer
> +                    node of ipu2.  */
> +                 maybe_superset
> +                   = (ipu->incoming[ipu->incoming_count - 1].icu->idx
> +                      >= ipu2->incoming[ipu2->incoming_count - 1].icu->idx);
> +               }
> +             if (maybe_subset)
> +               {
> +                 /* If the referrer nodes of ipu are a subset of the
> +                    referrer nodes of ipu2, then ipu's last referrer node
> +                    should have index smaller or equal to the last referrer
> +                    node of ipu2.  */
> +                 maybe_subset
> +                   = (ipu->incoming[ipu->incoming_count - 1].icu->idx
> +                      <= ipu2->incoming[ipu2->incoming_count -1].icu->idx);
> +               }
>               e3 = e1;
>               e4 = ipu2->incoming;
>               intersection = 0;
> ...
> but it doesn't work currently because we cannot access
> ipu->incoming[ipu->incoming_count - 1] at this point. Instead, we need to
> walk
> the next chain.  Keeping track of the tail of the edge chain would fix this,
> and might yield speed improvement.
> 

Committed here: https://sourceware.org/git/?p=dwz.git;a=commit;h=83addcc7c2eadeffb4fe27e433695d316101f6d1
Comment 2 Tom de Vries 2019-12-17 12:23:19 UTC
(In reply to Tom de Vries from comment #0)
> But create_import_tree phase 3 increases from 20s to 222s.
> 

With current master, we get with --no-odr:
...
create_import_tree phase 3
user: 5.42
sys : 0.00
...
and with --odr --odr-mode=basic:
...
create_import_tree phase 3
user: 104.19
sys : 0.00
...
Comment 3 Tom de Vries 2019-12-17 12:32:04 UTC
(In reply to Tom de Vries from comment #0)
> Looking at the perf data, we spend 51% of the time in dwz, 

> Furthermore, we spent 28% here:
> ...
>   7,02 │1481:   movb   $0x0,0x2c(%rax)
>        │            for (icu = ipu->next; icu; icu = icu->next)
>  28,04 │        mov    0x18(%rax),%rax
>   0,00 │        test   %rax,%rax
>   0,87 │      ↑ jne    1481
> ...
> which is this loop here:
> ...
>       for (icu = ipu->next; icu; icu = icu->next)
>         icu->seen = false;
> ...
> 

With current master, we now spent 41% of execution time in dwz, and of that 41% we spent 59% in this loop:
...
  1,39 │2895:┌─→test   %rax,%rax
  0,01 │     │↑ je     1fcd
       │     │        icu->seen = false;
 11,10 │     │  movb   $0x0,0x34(%rax)
       │     │      for (icu = ipu->next; icu; icu = icu->next)
 47,87 │     │  mov    0x20(%rax),%rax
  0,36 │     └──jmp    2895
...
Comment 4 Tom de Vries 2019-12-17 13:08:13 UTC
(In reply to Tom de Vries from comment #0)
> which is this loop here:
> ...
>       for (icu = ipu->next; icu; icu = icu->next)
>         icu->seen = false;
> ...
> 
> It might be possible to improve here by keeping track of this data in another
> way, say an array of bool or a bitvector, which could be memset to zero. It
> would also be interesting to know how sparse this array is, which could also
> be exploited.

I've measured on the clang-10 example with --odr --odr-mode=basic, and got an average occupancy of 1.93%.
Comment 5 Tom de Vries 2019-12-17 13:25:26 UTC
(In reply to Tom de Vries from comment #0)
> which is this loop here:
> ...
>       for (icu = ipu->next; icu; icu = icu->next)
>         icu->seen = false;
> ...
> 
> It might be possible to improve here by keeping track of this data in another
> way, say an array of bool or a bitvector, which could be memset to zero.

The array of bool + memset approach reduces create_import_tree phase 3 time from 104.19 to 23.00.
Comment 6 Tom de Vries 2019-12-17 14:01:05 UTC
(In reply to Tom de Vries from comment #5)
> (In reply to Tom de Vries from comment #0)
> > which is this loop here:
> > ...
> >       for (icu = ipu->next; icu; icu = icu->next)
> >         icu->seen = false;
> > ...
> > 
> > It might be possible to improve here by keeping track of this data in another
> > way, say an array of bool or a bitvector, which could be memset to zero.
> 
> The array of bool + memset approach reduces create_import_tree phase 3 time
> from 104.19 to 23.00.

Keeping track of min_seen and max_seen, and only memsetting the affected region gets us 21.77.
Comment 7 Tom de Vries 2019-12-17 22:24:01 UTC
(In reply to Tom de Vries from comment #6)
> (In reply to Tom de Vries from comment #5)
> > (In reply to Tom de Vries from comment #0)
> > > which is this loop here:
> > > ...
> > >       for (icu = ipu->next; icu; icu = icu->next)
> > >         icu->seen = false;
> > > ...
> > > 
> > > It might be possible to improve here by keeping track of this data in another
> > > way, say an array of bool or a bitvector, which could be memset to zero.
> > 
> > The array of bool + memset approach reduces create_import_tree phase 3 time
> > from 104.19 to 23.00.
> 
> Keeping track of min_seen and max_seen, and only memsetting the affected
> region gets us 21.77.

And the replacing the bool array with a bitvector with base type unsigned int, gets us 11.66.
Comment 8 Tom de Vries 2019-12-17 23:30:02 UTC
(In reply to Tom de Vries from comment #0)
> which is this loop here:
> ...
>       for (icu = ipu->next; icu; icu = icu->next)
>         icu->seen = false;
> ...
> 
> It might be possible to improve here by keeping track of this data in another
> way, say an array of bool or a bitvector, which could be memset to zero. 

Committed as https://sourceware.org/git/?p=dwz.git;a=commit;h=94e78911f39d3c256331d396def9dcedc80b9f65 .