Reddit announced on Tuesday that it’s updating its Robots Exclusion Protocol (robots.txt file), which tells automated web bots whether or not they are permitted to crawl a site.
Historically, robots.txt file was used to permit engines like google to scape a site after which direct people to the content. Nevertheless, with the rise of AI, web sites are being scraped and used to coach models without acknowledging the actual source of the content.
Together with the updated robots.txt file, Reddit will proceed rate-limiting and blocking unknown bots and crawlers from accessing its platform. The corporate told TechCrunch that bots and crawlers shall be rate-limited or blocked in the event that they don’t abide by Reddit’s Public Content Policy and don’t have an agreement with the platform.
Reddit says the update shouldn’t affect nearly all of users or good faith actors, like researchers and organizations, resembling the Web Archive. As a substitute, the update is designed to discourage AI corporations from training their large language models on Reddit content. In fact, AI crawlers could ignore Reddit’s robots.txt file.
The announcement comes a couple of days after a Wired investigation found that AI-powered search startup Perplexity has been stealing and scraping content. Wired found that Perplexity seems to disregard requests to not scrape its website, regardless that it blocked the startup in its robots.txt file. Perplexity CEO Aravind Srinivas responded to the claims and said that the robots.txt file isn’t a legal framework.
Reddit’s upcoming changes won’t affect corporations that it has an agreement with. For example, Reddit has a $60 million cope with Google that permits the search giant to coach its AI models on the social platform’s content. With these changes, Reddit is signaling to other corporations that wish to use Reddit’s data for AI training that they’ll need to pay.
“Anyone accessing Reddit content must abide by our policies, including those in place to guard redditors,” Reddit said in its blog post. “We’re selective about who we work with and trust with large-scale access to Reddit content.”
The announcement doesn’t come as a surprise, as Reddit released a brand new policy a couple of weeks ago that was designed to guide how Reddit’s data is being accessed and utilized by industrial entities and other partners.