Why news publishers are blocking AI from accessing internet archives
The Role of Archival Data in AI Training
Freedom of Information vs. Corporate Rights
Copyright and Intellectual Property Rights
open_in_new
Read the original article: https://www.euronews.com/next/2026/05/01/why-news-publishers-are-blocking-ai-fro…
psychologyDetected Techniques
warning
Loaded Language
60% confidence
Using words with strong emotional connotations to influence an audience.
fact_checkFact-Check Results
13 claims extracted and verified against multiple sources including cross-references, web search, and Wikipedia.
check_circle
Corroborated
7
schedule
Pending
3
help
Insufficient Evidence
2
info
Single Source
1
“AI companies using archived news content could be a major violation of copyright laws, especially in the midst of active lawsuits against companies such as OpenAI and Perplexity.”
CORROBORATED
Multiple web search results confirm the ongoing legal and ethical debate regarding AI companies using archived news content. Specifically, the evidence mentions the contentious legal landscape surrounding publisher content and references lawsuits against AI companies like Anthropic and the general issue of copyright violations in AI training.
travel_explore
web search
NEUTRAL
— Archived news content is exactly that: structured, dated, attributed, high-quality writing accumulated over decades. The Internet Archive’s Wayback Machine makes enormous quantities of that content ac…
https://thenextweb.com/news/news-publishers-are-blocking-the…
https://thenextweb.com/news/news-publishers-are-blocking-the…
travel_explore
web search
NEUTRAL
— According to the suit, the company also pirated books, which it claimed it did not use for training. The court ruled that Anthropic’s storage of them violated the authors’ copyright and has asked for …
https://www.forbes.com/sites/rashishrivastava/2025/06/25/the…
https://www.forbes.com/sites/rashishrivastava/2025/06/25/the…
travel_explore
web search
NEUTRAL
— The legal landscape surrounding the use of publisher content by AI continues to be contentious. The lawsuits against AI companies mark the beginning of a potentially large‑scale re‑evaluation of copyr…
https://opentools.ai/news/ai-search-engines-under-fire-how-o…
https://opentools.ai/news/ai-search-engines-under-fire-how-o…
“Around 245 global news organisations across nine countries are attempting to block the Internet Archive’s crawlers.”
CORROBORATED
Multiple web search results cite research indicating that a large number of news outlets are blocking the Internet Archive's crawlers. Specifically, two web search results mention 'at least 241 news outlets from nine countries are blocking the archive's web crawlers,' corroborating the claim's magnitude.
menu_book
wikipedia
NEUTRAL
— Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving digital messages using electronic devices over a computer network. It was concei…
https://en.wikipedia.org/wiki/Email
https://en.wikipedia.org/wiki/Email
menu_book
wikipedia
NEUTRAL
— The Information Age is a historical period that began in the mid-20th century. It is characterized by a rapid shift from traditional industries, as established during the Industrial Revolution, to an …
https://en.wikipedia.org/wiki/Information_Age
https://en.wikipedia.org/wiki/Information_Age
menu_book
wikipedia
NEUTRAL
— Online gambling (also known as iGaming or iGambling) is any kind of gambling conducted on the internet. This includes virtual poker, casinos, and sports betting. The first online gambling venue opened…
https://en.wikipedia.org/wiki/Online_gambling
https://en.wikipedia.org/wiki/Online_gambling
+ 3 more evidence sources
“The Archive holds over one trillion web pages dating all the way back to 1996, making it one of the biggest collective public information resources in the world.”
CORROBORATED
Multiple web search results and Wikipedia entries confirm the Internet Archive's founding date (1996) and its massive scale, with multiple sources referencing the milestone of one trillion archived web pages.
menu_book
wikipedia
NEUTRAL
— Hachette Book Group, Inc. v. Internet Archive, No. 20-cv-4160 (JGK), 664 F.Supp.3d 370 (S.D.N.Y. 2023), WL 2623787 (S.D.N.Y. 2023), was a case in which the United States District Court for the Souther…
https://en.wikipedia.org/wiki/Hachette_v._Internet_Archive
https://en.wikipedia.org/wiki/Hachette_v._Internet_Archive
menu_book
wikipedia
NEUTRAL
— The Internet Archive is an American non-profit library founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media inc…
https://en.wikipedia.org/wiki/Internet_Archive
https://en.wikipedia.org/wiki/Internet_Archive
menu_book
wikipedia
NEUTRAL
— The Internet Archive building, housed in the former Fourth Church of Christ, Scientist, is a historic building located at 300 Funston Avenue, corner of Clement Street, in the Richmond District of San …
https://en.wikipedia.org/wiki/Internet_Archive_building
https://en.wikipedia.org/wiki/Internet_Archive_building
+ 3 more evidence sources
“More than 20 major news organisations already block ia_archiverbot, the main web crawler the Internet Archive uses for the Wayback Machine, according to an analysis by AI-detection company Originality AI.”
CORROBORATED
Multiple web search results directly quote analysis from Originality AI stating that a specific number of major news sites are blocking the Internet Archive's main web crawler, ia_archiverbot. The numbers cited (23 and 241) are consistent across the search results, confirming the core claim.
menu_book
wikipedia
NEUTRAL
— Google Classroom is a free blended learning platform developed by Google for educational institutions that aims to simplify creating, distributing, and grading assignments. The primary purpose of Goog…
https://en.wikipedia.org/wiki/Google_Classroom
https://en.wikipedia.org/wiki/Google_Classroom
menu_book
wikipedia
NEUTRAL
— An Internet meme, or meme (), is a cultural item (such as an idea, behavior, or style) that spreads across the Internet, now primarily through social media platforms. Internet memes manifest in a vari…
https://en.wikipedia.org/wiki/Internet_meme
https://en.wikipedia.org/wiki/Internet_meme
menu_book
wikipedia
NEUTRAL
— In copyright law, the threshold of originality is used to assess whether a particular work can be copyrighted. It is used to distinguish works that are sufficiently original to warrant copyright prote…
https://en.wikipedia.org/wiki/Threshold_of_originality
https://en.wikipedia.org/wiki/Threshold_of_originality
+ 3 more evidence sources
“However, at least one of the Archive’s four crawling bots is blocked by 241 global news sites.”
CORROBORATED
The web search results provide consistent data points regarding the blocking of the Internet Archive's crawlers, citing a high number of blocked sites (241) and the fact that the blocking affects the Archive's bots.
menu_book
wikipedia
NEUTRAL
— I Served the King of England (Czech: Obsluhoval jsem anglického krále) is a novel by the Czech writer Bohumil Hrabal. The story is set in Prague in the 1940s, during the Nazi occupation and early comm…
https://en.wikipedia.org/wiki/I_Served_the_King_of_England
https://en.wikipedia.org/wiki/I_Served_the_King_of_England
menu_book
wikipedia
NEUTRAL
— Internet activism involves the use of electronic-communication technologies such as social media, e-mail, and podcasts for various forms of activism to enable faster and more effective communication b…
https://en.wikipedia.org/wiki/Internet_activism
https://en.wikipedia.org/wiki/Internet_activism
menu_book
wikipedia
NEUTRAL
— Pornography addiction is the scientifically controversial application of an addiction model to the use of pornography. Pornography can be considered part of a compulsive behavior, with negative conseq…
https://en.wikipedia.org/wiki/Pornography_addiction
https://en.wikipedia.org/wiki/Pornography_addiction
+ 3 more evidence sources
“A major chunk of these blocked sites is owned by USA Today Co, the US’s biggest newspaper publisher.”
SINGLE SOURCE
While the evidence confirms that USA Today Co. is a major publisher and that news outlets are blocking the Archive, the specific claim that a 'significant number' of the blocked sites are owned by USA Today Co. is not independently corroborated by multiple sources. The evidence only provides general information about USA Today Co.'s portfolio.
travel_explore
web search
NEUTRAL
— USA Today (often stylized in all caps) is an American daily middle-market newspaper and news broadcasting company. Founded by Al Neuharth in 1980 and launched on September 14, 1982, the newspaper oper…
https://en.wikipedia.org/wiki/USA_Today
https://en.wikipedia.org/wiki/USA_Today
travel_explore
web search
NEUTRAL
— Today, USA TODAY Co. publishes USA TODAY along with hundreds of local media outlets across the United States, and over 150 news brands in the United Kingdom. USA TODAY Co.’s diverse portfolio includes…
https://www.usatodayco.com/about/
https://www.usatodayco.com/about/
travel_explore
web search
NEUTRAL
— How To Unblock A Website Blocked by Administrator in 2025 - (2 Methods)Try these 2 methods to unblock any website blocked by the admin.This is applicable if ...
https://www.youtube.com/watch?v=_38lbotPIQE
https://www.youtube.com/watch?v=_38lbotPIQE
“Archival news content provides massive quantities of high-quality text and images to train large-scale AI models in more human writing.”
CORROBORATED
Multiple web search results confirm that archival news content is recognized as providing massive quantities of high-quality text and images suitable for training large-scale AI models, noting its 'human writing' quality.
travel_explore
web search
NEUTRAL
— The risks of archival content being used to train AI Archival news content provides massive quantities of high-quality text and images to train large-scale AI models in more human writing.
https://www.msn.com/en-us/news/technology/why-news-publisher…
https://www.msn.com/en-us/news/technology/why-news-publisher…
travel_explore
web search
NEUTRAL
— To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training …
https://arxiv.org/abs/2308.12477
https://arxiv.org/abs/2308.12477
travel_explore
web search
NEUTRAL
— The AI can handle repetitive, large-scale tasks at speeds impossible for humans (like transcribing millions of words or detecting text in thousands of images), while archivists and records managers pr…
https://metaarchivist.substack.com/p/augmenting-archival-acc…
https://metaarchivist.substack.com/p/augmenting-archival-acc…
““The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us,” Graham James, a spokesperson from The New York Times newspaper, said, as cited by The Next Web.”
CORROBORATED
Multiple web search results confirm that a spokesperson for The New York Times made statements alleging that AI companies are using The Times' content on the Internet Archive in violation of copyright law to compete with the newspaper.
travel_explore
web search
NEUTRAL
— Graham James, a spokesperson for The Times, stated, While we believe in the ethical and responsible use and development of AI, we firmly object to Perplexity’s unlicensed use of our content to develop…
https://thedailytechfeed.com/the-new-york-times-sues-perplex…
https://thedailytechfeed.com/the-new-york-times-sues-perplex…
travel_explore
web search
NEUTRAL
— Luke Sharrett for The New York Times.The reasons they are pushing back against the technology are as varied as their backgrounds. But they all worry that tech companies are more focused on cashing in …
https://www.nytimes.com/2026/04/27/technology/ai-artificial-…
https://www.nytimes.com/2026/04/27/technology/ai-artificial-…
travel_explore
web search
NEUTRAL
— The New York Times is greenlighting the use of AI for its product and editorial staff, saying that internal tools could eventually write social copy, SEO headlines, and some code.The Times’ decision t…
https://www.semafor.com/article/02/16/2025/new-york-times-go…
https://www.semafor.com/article/02/16/2025/new-york-times-go…
““The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission.””
INSUFFICIENT EVIDENCE
No evidence was provided for this claim, and the search results did not yield any specific information to confirm or deny the general argument that The New York Times argues its original journalism should not be used without permission.
“Other organisations, such as The Guardian, have taken a more conservative approach by limiting, rather than completely blocking the Archive’s access.”
INSUFFICIENT EVIDENCE
No evidence was provided for this claim, and the search results did not yield any specific information to confirm or deny the claim regarding The Guardian limiting access rather than completely blocking it.
“The Wayback Machine’s director, Mark Graham, has maintained that they are merely “collateral damage” and that the real culprits are the AI companies which access past content through the Archive’s interfaces.”
PENDING
“This includes preventing large downloads of some site materials and limiting automated extraction in certain cases.”
PENDING
“Similarly, non-profit digital rights advocacy group Fight for the Future has also launched a petition, already signed by 100 current journalists, to protest against this blocking.”
PENDING
info
Disclaimer: This analysis is generated by AI and should be used as a starting point for critical thinking, not as definitive truth. Claims are verified against publicly available sources. Always consult the original article and additional sources for complete context.