AI Startups Perplexity, OpenAI, and Anthropic Accused of Ignoring Robots.txt Requests to Scrape Web Content for Model Training

Sunday, 23 June 2024San Francisco, California United States of America

Technology

AI startups Perplexity, OpenAI, and Anthropic accused of ignoring robots.txt requests to scrape web content for model training.

Companies use large amounts of web-scraped text and data for chatbot training.

OpenAI and Anthropic also reportedly disregarding these requests according to TollBit.

Perplexity allegedly bypasses robots.txt requests from media publishers.

AI Startups Perplexity, OpenAI, and Anthropic Accused of Ignoring Robots.txt Requests to Scrape Web Content for Model Training

In a recent development, AI startup Perplexity has been dubbed as a ''BS machine' by Wired's global editorial director, Katie Drummond. The discussion about the AI search startup took place on the Squawk Box program. According to Wired, Perplexity allegedly ignores or bypasses robots.txt requests from media publishers to cease scraping their web content for free model training data. TollBit, a startup facilitating paid licensing deals between publishers and AI companies, found that OpenAI and Anthropic, two of the world's leading AI startups, are also reportedly disregarding these requests. Despite public statements claiming respect for robots.txt and having blocks to their specific web crawlers, the companies are accused of choosing to bypass the rule to scrape content from websites for free model training data. The issue has arisen due to the increasing demand for high-quality data to build powerful AI models. OpenAI and Anthropic, backed by Microsoft and Amazon respectively, use large amounts of web-scraped text and data to power their chatbots, ChatGPT and Claude. Some tech companies have argued that web content should not be considered under copyright for AI training data. The US Copyright Office is expected to update its guidance on AI and copyright later this year.

Confidence

80%

Doubts

Are there any potential legal consequences for the companies' actions?
Is it confirmed that the companies are definitely bypassing robots.txt requests?

Sources

78% The overall score is a weighted number that takes into account conflict of interest, bias, deception and other practices that undermine the credibility of the source. It is calculated as: (Site Conflicts Of Interest + Author Conflicts Of Interest) / 2.0 * 0.2 + ArticleBiasScore * 0.20 + UniquePointsScore * 0.05 + DeceptionScore * 0.20 + ReadabilityScore * 0.05 + FallacyScore * 0.20 A score that takes into consideration the content for flow, interruptions with ads, and overt search engine optimization techniques that makes the content hard to understand OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content Business Insider Kali Hays Sunday, 23 June 2024 05:58 Unique Points OpenAI and Anthropic are ignoring or circumventing the robots.txt rule that prevents automated scraping of websites for free model training data. TollBit found several AI companies, including OpenAI and Anthropic, acting in this way. Robots.txt is a code used since the late 1990s to prevent bot crawlers from scraping websites data. OpenAI and Anthropic have publicly stated they respect robots.txt but are bypassing it to retrieve or scrape all content from a website or page. Both OpenAI and Anthropic are behind popular chatbots, ChatGPT and Claude respectively, which serve up answers in the tone of a human using massive amounts of written text and data scraped from the web. OpenAI has struck deals with some publishers for access to content, including Axel Springer which owns Business Insider. Accuracy Despite public statements of respecting robots.txt and having blocks to their specific web crawlers, the companies are accused of choosing to bypass the rule. Deception (30%) The article makes several statements that imply deception. The author states 'According to TollBit’s findings, such blocks are not being respected, as claimed.' However, no evidence is provided in the article to support this claim. Additionally, the author states 'OpenAI and Anthropic have stated publicly that they respect robots.txt and blocks to their specific web crawlers.' But later in the article it is implied that they are not respecting these blocks by using phrases like 'choosing to bypass robots.txt'. This is a form of selective reporting, as the author only reports details that support their position and ignores any contradictory information. The author also uses emotional manipulation by stating 'The thirst for such training data has undermined robots.txt and the unofficial agreements supporting the use of this code.' This statement is intended to elicit an emotional response from readers, rather than providing factual information. With the rise of generative AI, startups and tech companies are racing to build the most powerful AI models. A key ingredient is high-quality data. The thirst for such training data has undermined robots.txt and the unofficial agreements supporting the use of this code. OpenAI and Anthropic have stated publicly that they respect robots.txt and blocks to their specific web crawlers. According to TollBit’s findings, such blocks are not being respected, as claimed. Fallacies (85%) The author makes an appeal to authority by citing TollBit's findings without providing any evidence or context about the reliability or credibility of this source. The author also uses inflammatory rhetoric by stating that OpenAI and Anthropic are 'ignoring requests' and 'choosing to bypass robots.txt' without providing any concrete examples or proof of this allegation. ]The world’s top two AI startups are ignoring requests by media publishers to stop scraping their web content for free model training data[. ']According to TollBit’s findings, such blocks are not being respected, as claimed[.', Bias (90%) The author expresses a clear bias against OpenAI and Anthropic for ignoring the robots.txt rule and scraping web content without permission from publishers. The author also implies that these companies are acting unethically by undermining established rules and agreements on the web. AI companies, including OpenAI and Anthropic, are simply choosing to "bypass" robots.txt in order to retrieve or scrape all of the content from a given website or page. OpenAI and Anthropic have been found to be either ignoring or circumventing an established web rule, called robots.txt, that prevents automated scraping of websites. Site Conflicts Of Interest (100%) None Found At Time Of Publication Author Conflicts Of Interest (100%) None Found At Time Of Publication
99% The overall score is a weighted number that takes into account conflict of interest, bias, deception and other practices that undermine the credibility of the source. It is calculated as: (Site Conflicts Of Interest + Author Conflicts Of Interest) / 2.0 * 0.2 + ArticleBiasScore * 0.20 + UniquePointsScore * 0.05 + DeceptionScore * 0.20 + ReadabilityScore * 0.05 + FallacyScore * 0.20 A score that takes into consideration the content for flow, interruptions with ads, and overt search engine optimization techniques that makes the content hard to understand Wired: AI startup Perplexity is 'BS machine' CNBC News Friday, 21 June 2024 13:30 Unique Points Wired's global editorial director, Katie Drummond, discussed an investigation into AI search startup Perplexity on Squawk Box The discussion about Perplexity took place on the Squawk Box program Accuracy No Contradictions at Time Of Publication Deception (100%) None Found At Time Of Publication Fallacies (100%) None Found At Time Of Publication Bias (100%) None Found At Time Of Publication Site Conflicts Of Interest (100%) None Found At Time Of Publication Author Conflicts Of Interest (0%) None Found At Time Of Publication
92% The overall score is a weighted number that takes into account conflict of interest, bias, deception and other practices that undermine the credibility of the source. It is calculated as: (Site Conflicts Of Interest + Author Conflicts Of Interest) / 2.0 * 0.2 + ArticleBiasScore * 0.20 + UniquePointsScore * 0.05 + DeceptionScore * 0.20 + ReadabilityScore * 0.05 + FallacyScore * 0.20 A score that takes into consideration the content for flow, interruptions with ads, and overt search engine optimization techniques that makes the content hard to understand OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content BestOfAI.com Sunday, 23 June 2024 05:59 Unique Points OpenAI and Anthropic are reportedly ignoring or bypassing the robots.txt web rule for free model training data. Despite public statements of respecting robots.txt and having blocks to their specific web crawlers, the companies are accused of choosing to bypass the rule. Accuracy , Deception (100%) None Found At Time Of Publication Fallacies (90%) The article reports on OpenAI and Anthropic allegedly bypassing the robots.txt rule to scrape content for free model training data. The author does not commit any formal or informal fallacies in their reporting of the issue. However, there are some instances of inflammatory rhetoric used in the article, such as 'despite public statements' and 'accused of not respecting'. These do not significantly impact the overall score due to their limited use. Despite public statements from both companies claiming respect for the rule and blocks to their specific web crawlers, TollBit's findings suggest these blocks are not being respected. OpenAI and Anthropic are accused of not respecting such blocks and choosing to ‘bypass’ robots.txt to scrape content from websites. Bias (80%) The author expresses a clear bias against OpenAI and Anthropic by accusing them of ignoring or circumventing the robots.txt rule without providing any evidence that they have personally done so. The author also uses language that depicts the companies as unscrupulous for scraping content from websites for free model training data. Despite public statements of respecting robots.txt and blocks to their specific web crawlers, OpenAI and Anthropic are accused of not respecting such blocks and choosing to ‘bypass’ robots.txt to scrape content from websites. The author expresses a clear bias against OpenAI and Anthropic by accusing them of ignoring or circumventing the robots.txt rule without providing any evidence that they have personally done so. Site Conflicts Of Interest (100%) None Found At Time Of Publication Author Conflicts Of Interest (0%) None Found At Time Of Publication
97% The overall score is a weighted number that takes into account conflict of interest, bias, deception and other practices that undermine the credibility of the source. It is calculated as: (Site Conflicts Of Interest + Author Conflicts Of Interest) / 2.0 * 0.2 + ArticleBiasScore * 0.20 + UniquePointsScore * 0.05 + DeceptionScore * 0.20 + ReadabilityScore * 0.05 + FallacyScore * 0.20 A score that takes into consideration the content for flow, interruptions with ads, and overt search engine optimization techniques that makes the content hard to understand OpenAI And Anthropic Allegedly Ignore Web Scraping Rules, Stirring Controversy Benzinga News Rounak Jain Sunday, 23 June 2024 06:00 Unique Points OpenAI and Anthropic are reportedly disregarding robots.txt requests from media publishers to cease scraping their web content for free model training data. TollBit, a startup facilitating paid licensing deals between publishers and AI companies, found that OpenAI and Anthropic allegedly ignore or bypass the web rule designed to prevent automated scraping of websites. OpenAI has previously struck deals with publishers for access to content, including Axel Springer and News Corp. Accuracy OpenAI and Anthropic allegedly ignore or bypass the web rule designed to prevent automated scraping of websites. Both OpenAI and Anthropic are behind popular chatbots, ChatGPT and Claude respectively, which serve up answers in the tone of a human using massive amounts of written text and data scraped from the web. Deception (100%) None Found At Time Of Publication Fallacies (100%) None Found At Time Of Publication Bias (95%) The author expresses a clear bias against OpenAI and Anthropic for allegedly ignoring web scraping rules. The author uses language that depicts the companies as extremes or unreasonable by stating 'despite public statements from OpenAI and Anthropic that they respect robots.txt and blocks to their specific web crawlers, TollBit’s findings suggest otherwise.' This implies that the companies are lying about respecting these rules. Despite public statements from OpenAI and Anthropic that they respect robots.txt and blocks to their specific web crawlers, TollBit’s findings suggest otherwise. Site Conflicts Of Interest (100%) None Found At Time Of Publication Author Conflicts Of Interest (100%) None Found At Time Of Publication

AI Startups Perplexity, OpenAI, and Anthropic Accused of Ignoring Robots.txt Requests to Scrape Web Content for Model Training

Confidence

80%

Doubts

Sources

78%

OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

Unique Points

Accuracy

Deception (30%)

Fallacies (85%)

Bias (90%)

Site Conflicts Of Interest (100%)

None Found At Time Of Publication

Author Conflicts Of Interest (100%)

None Found At Time Of Publication

99%

Wired: AI startup Perplexity is 'BS machine'

Unique Points

Accuracy

No Contradictions at Time Of Publication

Deception (100%)

None Found At Time Of Publication

Fallacies (100%)

None Found At Time Of Publication

Bias (100%)

None Found At Time Of Publication

Site Conflicts Of Interest (100%)

None Found At Time Of Publication

Author Conflicts Of Interest (0%)

None Found At Time Of Publication

92%

OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

Unique Points

Accuracy

Deception (100%)

None Found At Time Of Publication

Fallacies (90%)

Bias (80%)

Site Conflicts Of Interest (100%)

None Found At Time Of Publication

Author Conflicts Of Interest (0%)

None Found At Time Of Publication

97%

OpenAI And Anthropic Allegedly Ignore Web Scraping Rules, Stirring Controversy

Unique Points

Accuracy

Deception (100%)

None Found At Time Of Publication

Fallacies (100%)

None Found At Time Of Publication

Bias (95%)

Site Conflicts Of Interest (100%)

None Found At Time Of Publication

Author Conflicts Of Interest (100%)

None Found At Time Of Publication