Bug 28708 - run-debuginfod-webapi-concurrency.sh seems to be flaky
Summary: run-debuginfod-webapi-concurrency.sh seems to be flaky
Status: RESOLVED FIXED
Alias: None
Product: elfutils
Classification: Unclassified
Component: debuginfod (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-17 01:02 UTC by Evgeny Vereshchagin
Modified: 2022-04-24 10:23 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2021-12-17 00:00:00


Attachments
full log (273.66 KB, application/x-gzip)
2021-12-17 01:25 UTC, Evgeny Vereshchagin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Evgeny Vereshchagin 2021-12-17 01:02:32 UTC
elfutils is built on various architectures with https://packit.dev/ in https://github.com/evverx/elfutils and since run-debuginfod-webapi-concurrency.sh was added it has been failing more or less consistently on ppc64le and intermittently on the other architectures. The log can be found at https://copr-be.cloud.fedoraproject.org/results/packit/evverx-elfutils-53/fedora-rawhide-ppc64le/03059293-elfutils/builder-live.log.gz. It will expire eventually though but as far as I can see it's reported by buildbot from time to time as well.
Comment 1 Evgeny Vereshchagin 2021-12-17 01:25:29 UTC
Created attachment 13859 [details]
full log

Just in case, I've just attached the full log.
Comment 2 Frank Ch. Eigler 2021-12-17 01:53:33 UTC
Thanks for the report.  The logs indicate some unexplained glitch within libmicrohttpd (rejecting connections for no explained reason).  Maybe the builder is somehow strangely resource constrained?  We could make the test less assertive about 100% success of all those parallel curl jobs.
Comment 3 Evgeny Vereshchagin 2021-12-17 02:30:23 UTC
I think they are constrained in the sense that those machines are much slower than usual. On top of that the packages are built in a sandbox environment and that makes them even slower.
Comment 4 Mark Wielaard 2021-12-17 08:31:33 UTC
Note that packit doesn't use real hardware for various architectures but "container emulation" which causes various testcases to fail.

Although in this case it seems it is overloading the host. Maybe we can tune down the number of concurrent request tested, see also:
https://sourceware.org/pipermail/elfutils-devel/2021q4/thread.html#4488
If you have a better lower/upper bound or a way to test the limits of the machine.

We do have somewhat better buildbot workers for various architectures here:
https://builder.wildebeest.org/buildbot/#/builders?tags=elfutils
Comment 5 Frank Ch. Eigler 2021-12-17 18:48:33 UTC
(In reply to Mark Wielaard from comment #4)
> Note that packit doesn't use real hardware for various architectures but
> "container emulation" which causes various testcases to fail.
>
> Although in this case it seems it is overloading the host. [...]

Is there some way of finding out the host's actual limits?  Can we detect that we're running in an unusually constricted environment and skip this test?  ulimit -u?
Comment 6 Frank Ch. Eigler 2021-12-17 18:48:33 UTC
(In reply to Mark Wielaard from comment #4)
> Note that packit doesn't use real hardware for various architectures but
> "container emulation" which causes various testcases to fail.
>
> Although in this case it seems it is overloading the host. [...]

Is there some way of finding out the host's actual limits?  Can we detect that we're running in an unusually constricted environment and skip this test?  ulimit -u?
Comment 7 Evgeny Vereshchagin 2021-12-19 22:06:06 UTC
> Note that packit doesn't use real hardware for various architectures but
> "container emulation" which causes various testcases to fail.
> 

I think I ran into issues like that in https://github.com/evverx/elfutils/issues/32 and https://github.com/evverx/elfutils/issues/31. I ignore them for the most part. Though it would be great if they could be skipped there. Some of them seem to be easy to skip because they seem to trigger seccomp filters of some kind but I'm not sure about the rest.

> Although in this case it seems it is overloading the host. Maybe we can tune
> down the number of concurrent request tested, see also:
> https://sourceware.org/pipermail/elfutils-devel/2021q4/thread.html#4488
> If you have a better lower/upper bound or a way to test the limits of the
> machine.
> 

Thanks for the link. I'll take a look.

> We do have somewhat better buildbot workers for various architectures here:
> https://builder.wildebeest.org/buildbot/#/builders?tags=elfutils


As far as I understand the tests are run there on commits to the elfutils repository but I'm not sure how to test "PRs" there. If it was possible to use it before commits are merged into the master branch I wouldn't have started using Packit on GitHub probably.

> Is there some way of finding out the host's actual limits?  Can we detect that
> we're running in an unusually constricted environment and skip this test
> ulimit -u?

I think I can run almost anything there but since I'm not familiar with the test I'm not sure what I should look for.
Comment 8 Frank Ch. Eigler 2021-12-19 22:47:17 UTC
This test creates up to 100+few threads in debuginfod, and also 100 concurrent curl processes to talk to debuginfod.
Comment 9 Mark Wielaard 2021-12-20 17:17:00 UTC
(In reply to Evgeny Vereshchagin from comment #7)
> > Note that packit doesn't use real hardware for various architectures but
> > "container emulation" which causes various testcases to fail.
> > 
> I think I ran into issues like that in
> https://github.com/evverx/elfutils/issues/32 and
> https://github.com/evverx/elfutils/issues/31. I ignore them for the most
> part. Though it would be great if they could be skipped there. Some of them
> seem to be easy to skip because they seem to trigger seccomp filters of some
> kind but I'm not sure about the rest.

Easiest is to run containers with --security-opt seccomp=unconfined to make sure seccomp doesn't arbitrarily blocks syscalls (or worse returns ENOPERM instead on ENOSYS).

> > We do have somewhat better buildbot workers for various architectures here:
> > https://builder.wildebeest.org/buildbot/#/builders?tags=elfutils
>  
> As far as I understand the tests are run there on commits to the elfutils
> repository but I'm not sure how to test "PRs" there. If it was possible to
> use it before commits are merged into the master branch I wouldn't have
> started using Packit on GitHub probably.

There is a vacation and a nationwide lockdown coming up so I can see what I can do. I hope to connect the buildbot with patchworks so that you can easily test any submitted patch before committing.
Comment 10 Evgeny Vereshchagin 2021-12-20 23:21:28 UTC
(In reply to Mark Wielaard from comment #9)
> (In reply to Evgeny Vereshchagin from comment #7)
> > > Note that packit doesn't use real hardware for various architectures but
> > > "container emulation" which causes various testcases to fail.
> > > 
> > I think I ran into issues like that in
> > https://github.com/evverx/elfutils/issues/32 and
> > https://github.com/evverx/elfutils/issues/31. I ignore them for the most
> > part. Though it would be great if they could be skipped there. Some of them
> > seem to be easy to skip because they seem to trigger seccomp filters of some
> > kind but I'm not sure about the rest.
> 
> Easiest is to run containers with --security-opt seccomp=unconfined to make
> sure seccomp doesn't arbitrarily blocks syscalls (or worse returns ENOPERM
> instead on ENOSYS).
> 

Those containers are launched by Packit (or, more precisely, by mock) so I can't control how they are run. According to systemd-detect --virt those are nspawn containers and I'm 50% sure those failures are caused by a bug in either systemd-nspawn or libseccomp.

In the meantime, I added a couple of bash commands that show whether the test hit its "pid" limit set by either systemd on the host or systemd-nspawn (or both). pid.max is unfortunately set to "max" there so it isn't obvious how many tasks can be run there at the same time.
Comment 11 Frank Ch. Eigler 2022-04-03 00:43:01 UTC
OK some findings, when a similar sounding problem intermittently occurred on an s390x VM.

It seems that we were expecting too much of libmicrohttpd.

When it offers a thread-pool (which we trigger in debuginfod via the -Cnnn option), it splits a hypothetical concurrent-connection limit amongst all those threads.  When a new connection comes in, it seems to be just luck as to which thread gets woken up.  And if that thread has some active connections still (such as from previous transmission operations that were enqueued previously and still in progress), then the new connection may go over its private daemon->connection_limit and fail.  (At the same time, many threads may exist with much larger available connection limits, but they are not consulted.)

This is probably why Mark's experimental MHD_OPTION_CONNECTION_LIMIT set helped (1000ish->4000ish), because then dividing all those limits among the 100ish threads leaves 40 each to work from rather than 10.

Investigating some microhttpd modes/options that may trigger more favourable behaviour.  But if nothing appears, we may just need to turn down the tight expectations of this test case.
Comment 13 Mark Wielaard 2022-04-05 14:09:01 UTC
(In reply to Evgeny Vereshchagin from comment #12)
> FWIW with
> https://sourceware.org/git/?p=elfutils.git;a=commit;
> h=e646e363e72e06e0ed5574c929236d815ddcbbaf applied the test appears to be
> flaky on Packit on s390x:
> https://copr-be.cloud.fedoraproject.org/results/packit/evverx-elfutils-73/
> fedora-35-s390x/03942110-elfutils/builder-live.log.gz

So that log contains the feared:
error_count{libmicrohttpd="Server reached connection limit. Closing inbound connection.\n"} 35

And sadly I have also been able to replicate that on another s390x setup even with all the latest patches.

The thing they seem to have in common is that they are both s390x and have only 2 cores.

If I lower the -C100 to -C32 in run-debuginfod-webapi-concurrency.sh it does seem to always pass. But with -C50 or higher is does occasionally fail (the higher to more frequent it fails).

BTW. run-debuginfod-webapi-concurrency.sh seems stable on any other system I've thrown it at. So it isn't exactly clear what "such a system" is? Is it s390x specific?
Comment 14 Mark Wielaard 2022-04-24 10:23:56 UTC
commit 3bcf887340fd47d0d8a3671cc45abe2989d1fd6c
Author: Mark Wielaard <mark@klomp.org>
Date:   Sun Apr 24 12:16:58 2022 +0200

    debuginfod: Use MHD_USE_ITC in MHD_start_daemon flags
    
    This prevents the "Server reached connection limit. Closing inbound
    connection." issue we have been seeing in the
    run-debuginfod-webapi-concurrency.sh testcase. From the manual:
    
        If the connection limit is reached, MHD’s behavior depends a bit
        on other options. If MHD_USE_ITC was given, MHD will stop
        accepting connections on the listen socket. This will cause the
        operating system to queue connections (up to the listen() limit)
        above the connection limit. Those connections will be held until
        MHD is done processing at least one of the active connections. If
        MHD_USE_ITC is not set, then MHD will continue to accept() and
        immediately close() these connections.
    
    https://sourceware.org/bugzilla/show_bug.cgi?id=28708
    
    Signed-off-by: Mark Wielaard <mark@klomp.org>