Performance issue with systemd-coredump and container process linking 2000 shared libraries.

Mark Wielaard mark@klomp.org
Wed Jun 21 16:24:49 GMT 2023


Hi Romain,

On Tue, 2023-06-20 at 22:05 +0000, Romain GEISSLER wrote:
> 
> Our real use case happens on a Openshift 4.13 node, so the OS is Red Hat Core OS 9 (which I assume shares a lot of foundations with RHEL 9).
> 
> On our side Francois also told me this afternoon that he didn’t really reproduce the same thing with my reproducer posted here and the real systemd-coredump issue he witnessed live, and also noticed that with DEBUGINFOD_URLS unset/set to the empty string my reproducer has no problem anymore. What he witnessed on the real case (using perf/gdb) was that apparently lots of time was spend in elf_getdata_rawchunk and often in this kind of stack:
> 
> Samples: 65K of event 'cpu-clock:pppH', Event count (approx.): 16468500000                                                                                                                                 
> Overhead  Command         Shared Object             Symbol                                                                                                                                                 
>   98.24%  (sd-parse-elf)  libelf-0.188.so           [.] elf_getdata_rawchunk
>    0.48%  (sd-parse-elf)  libelf-0.188.so           [.] 0x00000000000048a3
>    0.27%  (sd-parse-elf)  libelf-0.188.so           [.] gelf_getphdr
>    0.11%  (sd-parse-elf)  libc.so.6                 [.] _int_malloc
>    0.10%  (sd-parse-elf)  libelf-0.188.so           [.] gelf_getnote
>    0.06%  (sd-parse-elf)  libc.so.6                 [.] __libc_calloc
>    0.05%  (sd-parse-elf)  [kernel.kallsyms]         [k] __softirqentry_text_start
>    0.05%  (sd-parse-elf)  libc.so.6                 [.] _int_free
> 
> 
> (gdb) bt
> #0  0x00007f0ba8a88194 in elf_getdata_rawchunk () from target:/lib64/libelf.so.1
> #1  0x00007f0ba98e5013 in module_callback.lto_priv () from target:/usr/lib64/systemd/libsystemd-shared-252.so
> #2  0x00007f0ba8ae7291 in dwfl_getmodules () from target:/lib64/libdw.so.1
> #3  0x00007f0ba98e6dc0 in parse_elf_object () from target:/usr/lib64/systemd/libsystemd-shared-252.so
> #4  0x0000562c474f2d5e in submit_coredump ()
> #5  0x0000562c474f57d1 in process_socket.constprop ()
> #6  0x0000562c474efbf8 in main ()
> 
> My reproducer actually doesn’t fully re-implement what systemd implements (the parsing of the package metadata is clearly omitted), so I thought I had reproduced the same problem while apparently I didn’t, sorry for that. We will also have to double check if really just using 2000 dummy libraries is enough or if this also needs to have a more complex binary like we have in our real case.
> 
> Tomorrow on our side we will have to play a bit with a local build of systemd-coredump and try to run it manually to better understand what’s going wrong.
> 

Seeing those performance results I understand why you were suspecting
the linked list data structure used in elf_getdata_rawchunk.

Would you be able to test a rebuild libelf with the attached patch,
which replaces that datastructure with a binary search tree?

It didn't really show much speedup locally (in the noise, maybe 0.01
sec faster on ~0.25 sec run). But if there are more than 2000 calls to
elf_getdata_rawchunk it should make things faster.

Thanks,

Mark
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-libelf-Replace-list-of-elf_getdata_rawchunk-results-.patch
Type: text/x-patch
Size: 7186 bytes
Desc: 
URL: <https://sourceware.org/pipermail/elfutils-devel/attachments/20230621/a2c6e49b/attachment.bin>


More information about the Elfutils-devel mailing list