Bug 29696 - intermittent libmicrohttpd assertion failures related to socket fd closing
Summary: intermittent libmicrohttpd assertion failures related to socket fd closing
Status: RESOLVED FIXED
Alias: None
Product: elfutils
Classification: Unclassified
Component: debuginfod (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Ryan Goldberg
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-10-17 14:15 UTC by Frank Ch. Eigler
Modified: 2023-06-16 15:23 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:


Attachments
Patch for 29696 (1.13 KB, patch)
2023-06-16 13:45 UTC, Ryan Goldberg
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Ch. Eigler 2022-10-17 14:15:50 UTC
In a range of libmicrohttpd versions, up to and including libmicrohttpd-0.9.75-3.fc36.x86_64, debuginfod occasionally crashes with messages like:

https://builder.sourceware.org/testrun/920819ee86861130393e12933821c5b544afeee4?filename=tests%2Frun-debuginfod-federation-metrics.sh.log#line1669

Fatal error in GNU libmicrohttpd daemon.c:3831: Failed to remove FD from epoll set.

Even without MHD_USE_EPOLL, a nearly identical message can come from a different code path.
Comment 1 Ryan Goldberg 2023-06-16 13:45:04 UTC
Created attachment 14933 [details]
Patch for 29696

The debuginfod cache config was using fdopen and then calling both fclose on the file stream & close on the original file descriptor. Since the fd is not dup'ed, this led to a race condition where if that fd was reused (by microhttpd) we'd end up prematurely closing their socket leading to the above issue.
Comment 2 Mark Wielaard 2023-06-16 13:53:46 UTC
Very nice find. How did you catch this btw?
Are there any tools that help find such a "double closes"?
If not maybe we can teach valgrind --track-fds=yes about it, which already can track fd leaks, so it shouldn't be too hard to make it also detect double/bad closes.

The patch seems obviously correct to me. Nice to now log close () failures, which should help catch similar issues early.

Small nitpick. The "}else{" is a bit of a style break with the rest of the code, which would say:
...
  }
else
  {
...
Comment 3 Ryan Goldberg 2023-06-16 14:42:27 UTC
I noticed that the issue was happening in run-debuginfod-federation-metrics.sh so to reproduce I was playing with sending lots of requests to a federation of servers. Only had the issue occur on the downstream so it was a client issue. This made it pretty quick to replicate, so I could go through debuginfod_query_server and see how far down I can put an early exit before seeing the race condition. That narrowed it down to debuginfod_config_cache. From there noticed the double close and it was smooth sailing.

I'm not sure about tooling around the double close but is it possible to know that something is a double close if the fd may just be reused? Since in this case for instance the close won't fail, we're just closing someone else's open, good to go fd. fwiw I looked in elfutils at least and we didn't use fdopen with a double close again.
Comment 4 Frank Ch. Eigler 2023-06-16 15:23:57 UTC
commit 938a52c22ee915ff2cea813edd5da66bc8184885
Author: Ryan Goldberg <rgoldber@redhat.com>
Date:   Fri Jun 16 10:20:04 2023 -0400

    debuginfod: PR29696: Removed secondary fd close in cache config causing a race condition