29976 – webapi connection pool eats all file handles

Bug 29976 - webapi connection pool eats all file handles

Summary: webapi connection pool eats all file handles

Status:	RESOLVED FIXED

Alias:	None

Product:	elfutils
Classification:	Unclassified
Component:	debuginfod (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-01-09 18:14 UTC by Ross Burton
Modified:	2023-01-11 15:34 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ross Burton 2023-01-09 18:14:52 UTC

If I start debuginfod without any concurrency limits:

[Mon Jan  9 17:40:14 2023] (2356243/2356243): libmicrohttpd error: Failed to create worker inter-thread communication channel: Too many open files

My machine has 256 cores, and stracing debuginfod shows that it fails to open more files after creating 510 epoll fds (twice):

epoll_create1(EPOLL_CLOEXEC)            = 1021
epoll_ctl(1021, EPOLL_CTL_ADD, 3, {events=EPOLLIN, data={u32=4027013664, u64=187651148175904}}) = 0
epoll_ctl(1021, EPOLL_CTL_ADD, 1020, {events=EPOLLIN, data={u32=2965961632, u64=281473647704992}}) = 0
mmap(NULL, 8454144, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xfff6b97b0000
mprotect(0xfff6b97c0000, 8388608, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone(child_stack=0xfff6b9fbea00, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2361982], tls=0xfff6b9fbf880, child_tidptr=0xfff6b9fbf210) = 2361982
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 1022
epoll_create1(EPOLL_CLOEXEC)            = 1023
epoll_ctl(1023, EPOLL_CTL_ADD, 3, {events=EPOLLIN, data={u32=4027014456, u64=187651148176696}}) = 0
epoll_ctl(1023, EPOLL_CTL_ADD, 1022, {events=EPOLLIN, data={u32=2965961632, u64=281473647704992}}) = 0
mmap(NULL, 8454144, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xfff6b8fa0000
mprotect(0xfff6b8fb0000, 8388608, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone(child_stack=0xfff6b97aea00, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2361983], tls=0xfff6b97af880, child_tidptr=0xfff6b97af210) = 2361983
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = -1 EMFILE (Too many open files)

ulimit -n is 1024, do I really need more just to start debuginfod if I have 256 cores?  As the web connections is 2xthreads and it appears to be using two fds per connection, maybe I do.

Should the connection pool have a hard limit when using the default? I doubt 512 incoming connections would be usual, and if that is needed then the user can specify -C.

Comment 1 Frank Ch. Eigler 2023-01-09 19:02:32 UTC

What sets "ulimit -n -> 1000" in your case?

Comment 2 Ross Burton 2023-01-09 19:56:05 UTC

Honestly, no idea.  Appears to be the default on ubuntu.

Comment 3 Ross Burton 2023-01-09 19:59:56 UTC

Yes, kernel defaults: 1024 soft, 4096 hard.

I *can* change it to 4096 but there's still the point that:

1) debugging the failure case isn't trivial
2) cores*2 threads in the connection pool probably doesn't scale linearly

Comment 4 Frank Ch. Eigler 2023-01-09 20:05:02 UTC

I assume "debuginfod -C $num -d $num" still works for you, in this battle of distro/site defaults.

Comment 5 Ross Burton 2023-01-09 20:20:09 UTC

Yes.

My use case is a test that uses debuginfod, so it works everywhere and as it only has to service a few requests I'm just passing -C2 -c2.

Comment 6 Frank Ch. Eigler 2023-01-10 23:04:00 UTC

please check out commit 7399e3bd7eb72d045 on elfutils.git for a test patch

Comment 7 Ross Burton 2023-01-11 11:44:11 UTC

Looks good to me!

Comment 8 Frank Ch. Eigler 2023-01-11 15:34:33 UTC

Pushed to master as dcb40f9caa7ca30

Author: Frank Ch. Eigler <fche@redhat.com>
Date:   Tue Jan 10 17:59:35 2023 -0500

    debuginfod PR29975 & PR29976: decrease default concurrency
    
    ... based on rlimit (rlimig -n NUM)
    ... based on cpu-affinity (taskset -c A,B,C,D ...)
    
    Signed-off-by: Frank Ch. Eigler <fche@redhat.com>