Annie Gilbertson,

Tech companies are turning to controversial tactics to feed their data-hungry artificial intelligence models, vacuuming up books, websites, photos, and social media posts, often unbeknownst to the creators. AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission. The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, Proof News also found material from YouTube megastars, including MrBeast (289 million subscribers, two videos taken), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111≥ million subscribers, 337≥ videos taken). Some of the material used to train AI also promoted conspiracies such as the “flat-earth theory.” Proof News created a tool to search for creators in the YouTube AI training dataset. “No one came to me and said, ‘We would like to use this,’” said David Pakman, host of The David Pakman Show, a left-leaning politics channel with more than 2 million subscribers and more than 2 billion views. Nearly 160 of his videos were swept up into the YouTube Subtitles training dataset. Four people work full time on Pakman’s enterprise, which posts multiple videos each day in addition to producing a podcast, TikTok videos, and material for other platforms. If AI companies are paid, Pakman said, he should be compensated for the use of his data. He pointed out that some media companies have recently penned agreements to be paid for use of their work to train AI. “This is my livelihood, and I put time, resources, money, and staff time into creating this content,” Pakman said. “There’s really no shortage of work.” “It’s theft,” said Dave Wiskus, the CEO of Nebula, a streaming service partially owned by its creators, some of whom have had their work taken from YouTube to train AI. Wiskus said it’s “disrespectful” to use creators’ work without their consent, especially since studios may use “generative AI to replace as many of the artists along the way as they can.” “Will this be used to exploit and harm artists? Yes, absolutely,” Wiskus said.

72%

The Daily's Verdict

This author has a mixed reputation for journalistic standards. It is advisable to fact-check, scrutinize for bias, and check for conflicts of interest before relying on the author's reporting.

Bias

80%

Examples:

  • ]Proof News found that subtitles from over 173,536 YouTube videos were used to train AI[.
  • Some of the material used to train AI also promoted conspiracies such as the ‘flat-earth theory.’
  • Tech companies are turning to controversial tactics to feed their data-hungry artificial intelligence models, vacuuming up books, websites, photos, and social media post, often unbeknownst to the creators.

Conflicts of Interest

100%

Examples:

  • If AI companies are paid, I should be compensated for the use of my data.
  • It’s theft.
  • No one came to me and said, ‘We would like to use this,’

Contradictions

50%

Examples:

  • ]Proof News found that subtitles from over 173,536 YouTube videos were used by Silicon Valley heavyweights...
  • Some of the material used to train AI also promoted conspiracies such as the ‘flat-earth theory.’

Deceptions

30%

Examples:

  • If AI companies are paid, I should be compensated for the use of my data.
  • It’s theft.
  • No one came to me and said, ‘We would like to use this,’

Recent Articles

Tech Giants in Hot Water: Apple, Anthropic, Nvidia, and Salesforce Caught Using YouTube Subtitles Without Permission for AI Training

Tech Giants in Hot Water: Apple, Anthropic, Nvidia, and Salesforce Caught Using YouTube Subtitles Without Permission for AI Training

Broke On: Tuesday, 16 July 2024 Tech giants Apple, Anthropic, Nvidia, and Salesforce have been using YouTube videos without permission for training their AI models. The subtitles from over 170,000 YouTube videos were harvested from 'The Pile' dataset and used in training. This breach of YouTube's terms raises ethical concerns regarding data ownership and potential biases in AI outputs.