Idea: Scrape all the posts from a subreddit as they’re being made, and “archive”
them on a lemmy instance, making it very clear it’s being rehosted, and linking
back to the original. It would probably have to be a “closed” lemmy instance
specifically for this purpose. The tool would run for multiple...
Idea: Scrape all the posts from a subreddit as they're being made, and "archive" them on a lemmy instance, making it very clear it's being rehosted, and linking back to the original. It would probably have to be a "closed" lemmy instance specifically for this purpose.
The tool would run for multiple subreddits, allowing Lemmy users to still be updated about and discuss any potential content that gets left behind.
Thoughts? It's probably iffy copyright-wise, but I think I can square my conscience with it.
Lemmy is based on a pull model, so if nobody on a different instance subscribes then it doesn't show up on anybody else's feeds. If an admin doesn't want that in their "All" feeds, they can block the instance.
Just make sure it's on its own instance with nothing else, something like that is bound to be EXTREMELY noisy, and not all admins are gonna be happy about it. I assume that's what you meant by closed?
Just be aware that it might not work. Reddit implemented rate limits on page loads to combat the inevitable web scraping as they turn off the API. Test out how fast you can pull pages before putting in any real coding time.
Reddit implemented rate limits on page loads to combat the inevitable web scraping
This whole time I was wondering how the API changes made any sense when anyone disgruntled about it could just turn to scraping, putting drastically more load on Reddit's infrastructure. It makes me feel a bit better that they aren't that clueless.