Bug 31551 - Better fail2ban scripts for search/ai spider fighting
Summary: Better fail2ban scripts for search/ai spider fighting
Status: NEW
Alias: None
Product: sourceware
Classification: Unclassified
Component: Infrastructure (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: overseers mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-25 00:22 UTC by Mark Wielaard
Modified: 2024-03-25 00:24 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Wielaard 2024-03-25 00:22:36 UTC
Search and AI spiders are difficult things. Since everything we do is
open and public we actually like people to easily find anything our
projects publish. But often these spiders (especially the new AI ones)
are very aggressive and ignore our robots.txt causing service
overload.

We have some fail2ban scripts that help and worst case we include
agressive spider ip addresses in the httpd block.include list
(by hand). But this doesn't really scale. One solution is smarter
fail2ban scripts. Another is providing sitemaps https://www.sitemaps.org/
so spiders have a known list of resources to index and we can more
easily block any that go outside those.

We should have some kind of automation of fail2ban and robots.txt.
Anything that aggressively hits urls that are in robots.txt should
get banned.