Bug 29976 - webapi connection pool eats all file handles
Summary: webapi connection pool eats all file handles
Status: RESOLVED FIXED
Alias: None
Product: elfutils
Classification: Unclassified
Component: debuginfod (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-09 18:14 UTC by Ross Burton
Modified: 2023-01-11 15:34 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ross Burton 2023-01-09 18:14:52 UTC
If I start debuginfod without any concurrency limits:

[Mon Jan  9 17:40:14 2023] (2356243/2356243): libmicrohttpd error: Failed to create worker inter-thread communication channel: Too many open files

My machine has 256 cores, and stracing debuginfod shows that it fails to open more files after creating 510 epoll fds (twice):

epoll_create1(EPOLL_CLOEXEC)            = 1021
epoll_ctl(1021, EPOLL_CTL_ADD, 3, {events=EPOLLIN, data={u32=4027013664, u64=187651148175904}}) = 0
epoll_ctl(1021, EPOLL_CTL_ADD, 1020, {events=EPOLLIN, data={u32=2965961632, u64=281473647704992}}) = 0
mmap(NULL, 8454144, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xfff6b97b0000
mprotect(0xfff6b97c0000, 8388608, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone(child_stack=0xfff6b9fbea00, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2361982], tls=0xfff6b9fbf880, child_tidptr=0xfff6b9fbf210) = 2361982
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 1022
epoll_create1(EPOLL_CLOEXEC)            = 1023
epoll_ctl(1023, EPOLL_CTL_ADD, 3, {events=EPOLLIN, data={u32=4027014456, u64=187651148176696}}) = 0
epoll_ctl(1023, EPOLL_CTL_ADD, 1022, {events=EPOLLIN, data={u32=2965961632, u64=281473647704992}}) = 0
mmap(NULL, 8454144, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xfff6b8fa0000
mprotect(0xfff6b8fb0000, 8388608, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone(child_stack=0xfff6b97aea00, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2361983], tls=0xfff6b97af880, child_tidptr=0xfff6b97af210) = 2361983
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = -1 EMFILE (Too many open files)

ulimit -n is 1024, do I really need more just to start debuginfod if I have 256 cores?  As the web connections is 2xthreads and it appears to be using two fds per connection, maybe I do.

Should the connection pool have a hard limit when using the default? I doubt 512 incoming connections would be usual, and if that is needed then the user can specify -C.
Comment 1 Frank Ch. Eigler 2023-01-09 19:02:32 UTC
What sets "ulimit -n -> 1000" in your case?
Comment 2 Ross Burton 2023-01-09 19:56:05 UTC
Honestly, no idea.  Appears to be the default on ubuntu.
Comment 3 Ross Burton 2023-01-09 19:59:56 UTC
Yes, kernel defaults: 1024 soft, 4096 hard.

I *can* change it to 4096 but there's still the point that:

1) debugging the failure case isn't trivial
2) cores*2 threads in the connection pool probably doesn't scale linearly
Comment 4 Frank Ch. Eigler 2023-01-09 20:05:02 UTC
I assume "debuginfod -C $num -d $num" still works for you, in this battle of distro/site defaults.
Comment 5 Ross Burton 2023-01-09 20:20:09 UTC
Yes.

My use case is a test that uses debuginfod, so it works everywhere and as it only has to service a few requests I'm just passing -C2 -c2.
Comment 6 Frank Ch. Eigler 2023-01-10 23:04:00 UTC
please check out commit 7399e3bd7eb72d045 on elfutils.git for a test patch
Comment 7 Ross Burton 2023-01-11 11:44:11 UTC
Looks good to me!
Comment 8 Frank Ch. Eigler 2023-01-11 15:34:33 UTC
Pushed to master as dcb40f9caa7ca30

Author: Frank Ch. Eigler <fche@redhat.com>
Date:   Tue Jan 10 17:59:35 2023 -0500

    debuginfod PR29975 & PR29976: decrease default concurrency
    
    ... based on rlimit (rlimig -n NUM)
    ... based on cpu-affinity (taskset -c A,B,C,D ...)
    
    Signed-off-by: Frank Ch. Eigler <fche@redhat.com>