Abusive (AI) crawlers
Why do they get stuck crawling gitea?
Hi, it looks like I'm writing a rare blog post. Mostly so that this stuff can be found in search, whee. And today's topic is everyone's “favourite” - crawling bots that don't respect your bandwidth.
Primarily, I am talking about an enormous distributed scraping setup that's running on Huawei Cloud (AS136907) and Alibaba Cloud (AS45102). It's most recognizable by directly loading pages without any extra resources, under seemingly normal looking user agents. Yet it does so from dozens, hundreds, maybe thousands of different IP addresses, all which own user agent picked from a list at random.
Said scraping network also fucking sucks. It is clearly targeting git forges like GitLab or Gitea instances, and in fact I've mostly noticed people complaining about it specifically in context of these.
Yet despite the targeting, it absolutely sucks and scrapes every single fucking page. Everything that exists with an href
gets caught in the net of the damn network.
It does not deduplicate links, it doesn't bother scraping APIs. It will go through every filter and every page of anything vaguely dynamic, it will go through every representation of each page that can exist ever.
And then it will try to scrape it again and again over some period of time.
Now, most of the crawlers that identify themselves and don't hop between IPs on every request, do honor robots.txt, and thankfully went away entirely when I edited it to say "go away". But of them, ImageSift by Hive has a few bots that claim to be of their scraping effort, and even check robots.txt. And then proceed to ignore it and snatch some random URL in a pattern eerily similar to ones described above, where they just go through every link on the page without caring to deduplicate anything, or avoid dynamic filters, etc. IPs of abusive scrapers claiming to be ImageSift, though I have some doubts, are as follows, and might be extended in the future. Please keep in mind the date of this article before you slap a block on them, thank you. Abusive IPs: 64.124.8.145, 74.80.208.139, 74.80.208.224.
Personally I've solved this issue by DROPping a bunch of IP ranges from relevant AS-es, because it turns out they mixed their corporate infra ranged with the IP addresses they provide to customers. Very cool Huawei and Alibaba, very helpful. One thing of note here is that the country doesn't matter. The IP ranges clearly map to specific cloud providers rather than specific countries. At least in my case, that is.
On bigger scale, where slapping an iptables DROP is not viable, I've seen people use Xe's Techaro Anubis, namely on GitLab instance ran by Gnome. It's a pretty nice solution, honestly. Hopefully I won't get more bots that will require me dropping that in.
Things that do NOT help anyone when fighting AI abuse however, are CAPTHAs. While the above solution is somewhat computationally "wasteful", it works without much effort from the visitors of the site. CAPTCHAs however, suck, and have been getting way worse recently as the companies running them start to use generative AI. Yes I mean hCaptcha. You've probably seen one of them, awful molten blurry stuff, asking you to identify which fever dream of a GPU matches an equally feverish example, with nonsense patterns or words that are designed to throw off casual image recognition bots. I struggle with them, and I am scared to consider how kids, elderly or anyone with unlucky enough disability will handle these. They keep getting worse and less recognizable, yet there's zero way to bypass them in every case they are used. I'm left hoping that eventually the company will move away from this “solution”, or people will force them to.
I don't think we are at the end of open internet. But it's yet another drop in the monster that is the list of things you need to run a website at all. And in a way, running own website away from some kind of managed hosting is already a rare thing, done by those with enough knowledge and time. But I am really hoping that people will still try to do it, even with a hosting like Neocities, because it's much better for everyone when there are things beyond Discord or Twitter-like of the year.