I like it.
One of his Mastodon posts on this (chirp.zadzmo.org) is interesting.
How much am I spending on this? About a Raspberry Pi, give or take, and I have Facebook, Amazon, and Anthropic slamming it hard at the same time.
Buy a domain. Rent a small VM. Setup a small, simple site with some random content and a link into a Nepenthes tar pit. Check in on it periodically to see which bots are crawling it.
As he said, we could also use the logs from these tar pit sites to identify crawlers so other legitimate sites can block them.
Imagine we had a few hundred of these sites running on various private VMs wasting crawler time and filling their training data with garbage, and we had tens of thousands of sites full of legitimate content blocking those crawlers—both for server stability and to avoid feeding these beasts for free.
Update: He beat me to the published crawler list (chirp.zadzmo.org) (direct link (zadzmo.org)).
By happenstance, if anyone wants a list of what is absolutely a crawler as a JSON array, here it is - updated every 15 minutes:
https://zadzmo.org/code/nepenthes/crawlers.json
What does "absolutely a crawler" mean? More than 100 hits in the tarpit.