We’ve covered the AI scraper bots before. These just hit web pages over and over, at high speed, to scrape new training data for LLMs. They’re an absolute plague across the whole World Wide Web and…
Edit: But also - why do AI scrapers request pages that show differences between versions of wiki pages (or perform other similarly complex requests)? What’s the point of that anyway?
This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.
Any crawler that doesn’t know what their doing and doesn’t respect robots but wants to crawl an entire domain will end up following these sorts of links naturally. It has no sense that the requests are “complex”, just that it’s fetching a URL with a few more query parameters than it started at.
The article even alludes to how to take advantage of this with it’s “trap the bots in a maze of fake pages” suggestion. Even crawlers that know what they’re doing will sometimes struggle with infinite URL spaces.
This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.
It’s so ridiculous - supposedly these people have access to a super-smart AI (which is supposedly going to take all our jobs soon), but the AI can’t even tell them which pages are worth scraping multiple times per second and which are not. Instead, they appear to kill their hosts like maladapted parasites regularly. It’s probably not surprising, but still absurd.
Edit: Of course, I strongly assume that the scrapers don’t use the AI in this context (I guess they only used it to write their code based on old Stackoverflow posts). Doesn’t make it any less ridiculous though.
This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.
Any crawler that doesn’t know what their doing and doesn’t respect robots but wants to crawl an entire domain will end up following these sorts of links naturally. It has no sense that the requests are “complex”, just that it’s fetching a URL with a few more query parameters than it started at.
The article even alludes to how to take advantage of this with it’s “trap the bots in a maze of fake pages” suggestion. Even crawlers that know what they’re doing will sometimes struggle with infinite URL spaces.
It’s so ridiculous - supposedly these people have access to a super-smart AI (which is supposedly going to take all our jobs soon), but the AI can’t even tell them which pages are worth scraping multiple times per second and which are not. Instead, they appear to kill their hosts like maladapted parasites regularly. It’s probably not surprising, but still absurd.
Edit: Of course, I strongly assume that the scrapers don’t use the AI in this context (I guess they only used it to write their code based on old Stackoverflow posts). Doesn’t make it any less ridiculous though.