Performance issue with systemd-coredump and container process linking 2000 shared libraries.

Romain Geissler romain.geissler@amadeus.com
Wed Jun 14 13:39:20 GMT 2023


Hi,

I am reaching out the elfutils mailing to discuss one issue we have seen in one
of our OpenShift production workload. Note that if you are a Red Hat employee,
this actually corresponds to Red Hat support case 03536980 which I have yet to
update with this informations.

Description of the issue:
While moving from OpenShift 4.12 to 4.13, we also move the system from a RHEL 8
base to a RHEL 9 one. Among the changes, it upgrades systemd to version 252.
Historically systemd-coredump use to mind only about buildids when analysing
core dumps, however now it also tries to parse the package metadata note section
(cf https://systemd.io/ELF_PACKAGE_METADATA/).

We use sytemd-coredump as our default core handler, and being an OpenShift node
we run there mainly containers. As part of our container, we have a particular
binary which links against close to 2000 shared libraries. And when this program
crashes, it takes several minutes for systemd-coredump to "analyze" the core,
which is spent in parsing the elf part of it.

I have been able to extract from the systemd code (cf
https://github.com/systemd/systemd/blob/22afd5cb2cf06418e5a84907b1ff782d29a33e47/src/shared/elf-util.c#L482 )
the logic of systemd into a simplified C program that links against systemd to
show how "slow" is systemd. I will attach this program a bit after, and also
explain a bit my findings and what "fix" I suggested internally we propose to
systemd folks.

First to reproduce, we can create a dummy binary that links against 2000 lib.
Credits for finding and writing this simple reproducer goes to Francois Rigault,
one of my colleague:

cat > lib.c << EOF
int TOTO = 42;
EOF

cat > generate-core.c << EOF
#include <stdio.h>
int main() {
        printf("started\n");
        int* i = 0;
        *i = 42;
}
EOF

N_LIBS=2000
LIBS=""
for i in {1..${N_LIBS}}; do
        [[ -f liba$i.so ]] || gcc -shared -fPIC -o liba$i.so lib.c -D TOTO=var$i
done
echo libs built

gcc -o generate-core generate-core.c -L. -Wl,-rpath,$(pwd) -la{1..${N_LIBS}}
echo running binary
time ./generate-core

# Here a core is generated, rename it to "the-core" in the current directory.

# Finally, remove all libaN.so files, to simulate the fact that systemd-coredump
# runs in a different mount filesystem as the container process which crashed
# (otherwise the following reproducer might be "fast"):
rm liba*.so



Once a core dump has been generated, I tried to mimic the code of systemd-coredump.
It looks like this:

cat > mimic-systemd-coredump.c << EOF
#include "elfutils/libdwfl.h"
#include <fcntl.h>
#include <assert.h>

static const Dwfl_Callbacks callbacks = {
    .find_elf = dwfl_build_id_find_elf,
    .section_address = dwfl_offline_section_address,
    .find_debuginfo = dwfl_standard_find_debuginfo,
};

void print_elf_error()
{
    printf("Elf error: %s\n", elf_errmsg(elf_errno()));
}

void print_dwfl_error()
{
    printf("Dwfl error: %s\n", dwfl_errmsg(dwfl_errno()));
}

// In Systemd, this tries to retrieve the Json package metadata note.
// returns a value > 0 when data is found
// returns 0 when no data is found, but no error happened
// returns a value < 0 when a strong error happen and we have to stop.
int parse_module_package_metadata_best_effort(Elf* elf, const char* name)
{
    size_t n_program_headers;
    size_t r = elf_getphdrnum(elf, &n_program_headers);

    if (r == 0)
    {
        printf("Found %lli program header for module %s\n", n_program_headers, name);
        // Don't implement the full systemd logic here.

        return 1;
    }
    else
    {
        // Best effort parsing: silently ignore the error.
        return 0;
    }
}

static int module_callback(Dwfl_Module *mod __attribute__ ((unused)),
        void **userdata __attribute__ ((unused)),
        const char *name __attribute__ ((unused)),
        Dwarf_Addr start, void *arg)
{
    Elf* elf = NULL;

    GElf_Addr bias;
    // Comment the next line to "improve" performances.
    elf = dwfl_module_getelf(mod, &bias);

    if (elf)
    {
        printf("dwfl_module_getelf has found the elf program %s\n", name);
        int ret = parse_module_package_metadata_best_effort(elf, name);

        if (ret > 0)
            return DWARF_CB_OK;
        else if (ret < 0)
            return DWARF_CB_ABORT;
        else
            elf = arg;
    }
    else
    {
        printf("dwfl_module_getelf returned NULL for module %s\n", name);
        //print_dwfl_error();
        elf = arg;
    }

    size_t n_program_headers;
    assert(elf_getphdrnum(elf, &n_program_headers) >= 0);
    printf("iterate: %s: %lli\n", name, n_program_headers);

    for (size_t i = 0; i < n_program_headers; ++i) {
        GElf_Phdr mem, *program_header;
        Elf_Data *data;

        /* The core file stores the ELF files in the PT_LOAD segment. */
        program_header = gelf_getphdr(elf, i, &mem);
        if (!program_header || program_header->p_type != PT_LOAD)
            continue;

        /* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */
        if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz)
        {
            //printf("skip !!!\n");
            continue;
        }

        /* Now get a usable Elf reference, and parse the notes from it. */
        data = elf_getdata_rawchunk(elf,
                program_header->p_offset,
                program_header->p_filesz,
                ELF_T_NHDR);
        if (!data)
            continue;

        Elf *memelf = elf_memory(data->d_buf, data->d_size);
        if (!memelf)
        {
            elf_end(memelf);
            continue;
        }

        int ret = parse_module_package_metadata_best_effort(memelf, name);
        elf_end(memelf);

        if (ret > 0)
            return DWARF_CB_OK;
        else if (ret < 0)
            return DWARF_CB_ABORT;
    }

    return DWARF_CB_OK;
}

int main()
{
    elf_version(EV_CURRENT);

    int fd = open ("the-core", O_RDONLY);
    assert(fd != 0);

    Elf *elf = elf_begin(fd, ELF_C_READ_MMAP, NULL);
    assert(elf);

    int ret;
    GElf_Ehdr elf_header;
    assert(gelf_getehdr(elf, &elf_header) != NULL);
    printf("elf_header.e_type = %i\n", elf_header.e_type);
    assert(elf_header.e_type == ET_CORE);

    Dwfl *dwfl = dwfl_begin(&callbacks);

    ret = dwfl_core_file_report(dwfl, elf, NULL);
    printf("dwfl_core_file_report: %i\n", ret);
    assert(ret >= 0);

    ret = dwfl_report_end(dwfl, NULL, NULL);
    assert(ret == 0);

    ptrdiff_t offset = dwfl_getmodules (dwfl, &module_callback, elf, 0);
    if (offset < 0)
        assert(false);

    dwfl_end (dwfl);

    return 0;
}
EOF


And it can be compiled with:
gcc -O2 -o mimic-systemd-coredump  mimic-systemd-coredump.c -lelf -ldw


When running ./mimic-systemd-coredump we can see that the call to dwfl_module_getelf
always return NULL for libaN.so files, so we fallback on the systemd logic to try
to parse the package metadata section from the core directly. It takes minutes
to run the full program. If however I comment the call to dwfl_module_getelf, then
it runs in less than a second.

My understanding of the will of systemd developers is that they hoped that libdwfl would
return some "partial" Elf* reference when calling dwfl_module_getelf, from the elf
headers found in the core for each and every shared library (the first page of the
PT_LOAD mappings that the kernel always dumps even when the mapping is file backed).
However it seems that behind the hood it doesn't (is it linked to core_file_read_eagerly
which seems to always return false in this case ?), and instead it uses the
.find_elf = dwfl_build_id_find_elf callback which tries to find the file by buildid
on the filesystem. For some unknown reason to me, calling dwfl_module_getelf is very
slow (I wouldn't expect that looking on the filesytem by buildid is that slow actually).

So, is this behavior of dwfl_module_getelf expected ? If yes, do you agree that we shall
advise systemd-coredump developer to invert their logic, to first try to look for partial
elf header from the core's PT_LOAD section, then only fallback to the more reliable
dwfl_module_getelf if it didn't work ? In practice, we have tried the following patch
applied to systemd v253 and it seems ot "fix" the above mentionned case:


commit 9463e1f22cbaaf68de9b547da243e70f1b2f622a
Author: Romain Geissler <romain.geissler@amadeus.com>
Date:   Wed Jun 14 11:59:31 2023 +0000

    First try to fix core dump issue on v253.

diff --git a/src/shared/elf-util.c b/src/shared/elf-util.c
index 8da16f528f..fadb3e81a3 100644
--- a/src/shared/elf-util.c
+++ b/src/shared/elf-util.c
@@ -503,26 +503,13 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name,
         if (r < 0)
                 return DWARF_CB_ABORT;
 
-        /* The .note.package metadata is more difficult. From the module, we need to get a reference
-         * to the ELF object first. We might be lucky and just get it from elfutils. */
-        elf = sym_dwfl_module_getelf(mod, &bias);
-        if (elf) {
-                r = parse_package_metadata(name, id_json, elf, NULL, c);
-                if (r < 0)
-                        return DWARF_CB_ABORT;
-                if (r > 0)
-                        return DWARF_CB_OK;
-        } else
-                elf = c->elf;
-
-        /* We did not get the ELF object, or it's just a reference to the core. That is likely
-         * because we didn't get direct access to the executable, and the version of elfutils does
-         * not yet support parsing it out of the core file directly.
-         * So fallback to manual extraction - get the PT_LOAD section from the core,
+        /* First try with the (maybe partial) ELF object we can get from the core itself. Indeed
+         * the core might run in a different namespace in which case the dwfl_module_getelf
+         * function to try to open the file by build id in the current mount namespace.
+         * So first use manual extraction - get the PT_LOAD section from the core,
          * and if it's the right one we can interpret it as an Elf object, and parse
          * its notes manually. */
-
-        r = sym_elf_getphdrnum(elf, &n_program_headers);
+        r = sym_elf_getphdrnum(c->elf, &n_program_headers);
         if (r < 0) {
                 log_warning("Could not parse number of program headers from core file: %s",
                             sym_elf_errmsg(-1)); /* -1 retrieves the most recent error */
@@ -536,12 +523,16 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name,
                 Elf_Data *data;
 
                 /* The core file stores the ELF files in the PT_LOAD segment. */
-                program_header = sym_gelf_getphdr(elf, i, &mem);
+                program_header = sym_gelf_getphdr(c->elf, i, &mem);
                 if (!program_header || program_header->p_type != PT_LOAD)
                         continue;
 
+                /* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */
+                if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz)
+                        continue;
+
                 /* Now get a usable Elf reference, and parse the notes from it. */
-                data = sym_elf_getdata_rawchunk(elf,
+                data = sym_elf_getdata_rawchunk(c->elf,
                                                 program_header->p_offset,
                                                 program_header->p_filesz,
                                                 ELF_T_NHDR);
@@ -555,7 +546,18 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name,
                 if (r < 0)
                         return DWARF_CB_ABORT;
                 if (r > 0)
-                        break;
+                        return DWARF_CB_OK;
+        }
+
+        /* Retrieving package_metadata only from the core fiel failed. Try to to get a reference
+         * to the ELF object instead. We might be lucky and just get it from elfutils or the filesystem. */
+        elf = sym_dwfl_module_getelf(mod, &bias);
+        if (elf) {
+                r = parse_package_metadata(name, id_json, elf, NULL, c);
+                if (r < 0)
+                        return DWARF_CB_ABORT;
+                if (r > 0)
+                        return DWARF_CB_OK;
         }
 
         return DWARF_CB_OK;


Some other side question: on the long run, wouldn't it make sense that elfutils tries to parse the
json package metadata section by itself, just like it does for the buildid, rather than implementing
this logic in systemd ?

Thanks for reading ;)

Cheers,
Romain


More information about the Elfutils-devel mailing list