Tech Giants in Hot Water: Apple, Anthropic, Nvidia, and Salesforce Caught Using YouTube Subtitles Without Permission for AI Training

San Francisco, California, California, USA United States of America
Apple used the dataset to train OpenELM model before announcing new AI capabilities.
Tech giants Apple, Anthropic, Nvidia, and Salesforce used YouTube subtitles without permission for AI training.
The dataset 'The Pile' containing over 170,000 YouTube videos and their subtitles was used.
Use of YouTube subtitles raises ethical concerns regarding data ownership and privacy.
Tech Giants in Hot Water: Apple, Anthropic, Nvidia, and Salesforce Caught Using YouTube Subtitles Without Permission for AI Training

In recent news, several tech giants, including Apple, Anthropic, Nvidia, and Salesforce, have been found to have used YouTube videos without permission for training their AI models. The subtitles from over 170,000 YouTube videos belonging to more than 48,000 channels were harvested and used in the dataset. This is a breach of YouTube's terms and conditions.

The dataset, known as 'The Pile,' was provided by a non-profit organization called EleutherAI for developers and academics. However, it appears that some of the largest tech companies have also made use of this publicly available dataset. Apple, for instance, used The Pile to train its OpenELM model before announcing new AI capabilities on iPhones and MacBooks.

The use of YouTube subtitles is a contentious issue as it raises ethical concerns regarding data ownership and privacy. Additionally, the dataset reportedly contains profanity and biases against certain groups, which could potentially influence the AI models' outputs.

It is important to note that while Apple and other companies may have used the dataset in good faith, they are still responsible for ensuring that their sources of training data are ethical and legal. YouTube has not yet responded to requests for comment on this matter.

This incident highlights the need for greater transparency and regulation in the use of training data for AI models. As technology continues to advance, it is crucial that companies prioritize ethical practices and respect the intellectual property rights of content creators.



Confidence

91%

Doubts
  • Is it confirmed that the companies knew they didn't have permission to use the subtitles?
  • Were all of the subtitles in 'The Pile' harvested, or just a subset?
  • What specific ethical concerns does this raise beyond data ownership and privacy?

Sources

97%

  • Unique Points
    • Apple, Nvidia, Salesforce used subtitle files from YouTube videos without consent to train AI models
    • 173,536 YouTube videos and 48,000 channels had their subtitles used in the dataset
    • EleutherAI non-profit provided the dataset for developers and academics
    • Apple used the Pile dataset to train OpenELM model before announcing new AI capabilities on iPhones and MacBooks
  • Accuracy
    No Contradictions at Time Of Publication
  • Deception (100%)
    None Found At Time Of Publication
  • Fallacies (85%)
    The article contains an example of a dichotomous depiction and an appeal to authority. It presents creators' videos as being used without their consent, implying a negative situation for the content creators. Additionally, it references the companies' research papers and posts to show that they used the Pile dataset but does not present counterarguments or alternative viewpoints on this matter.
    • A number of tech giants, including Apple, trained AI models on YouTube videos without the consent of the creators...
    • Apple, Nvidia, and Salesforce—companies valued in the hundreds of billions and trillions of dollars—describe in their research papers and posts how they used the Pile to train AI.
    • An investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.
  • Bias (100%)
    None Found At Time Of Publication
  • Site Conflicts Of Interest (100%)
    None Found At Time Of Publication
  • Author Conflicts Of Interest (100%)
    None Found At Time Of Publication

72%

  • Unique Points
    • Proof News found that subtitles from 173,536 YouTube videos were used by Anthropic, Nvidia, Apple, and Salesforce to train AI without their consent.
    • Some material used to train AI promoted conspiracies such as the ‘flat-earth theory’.
    • David Pakman, host of The David Pakman Show with over 2 million subscribers and 2 billion views, had nearly 160 of his videos taken for training.
  • Accuracy
    • ]Proof News found that subtitles from over 173,536 YouTube videos were used to train AI[.
    • The dataset called YouTube Subtiters contains video transcripts from channels like Khan Academy, MIT, and Harvard.
  • Deception (30%)
    The article engages in sensationalism by implying that the use of YouTube videos for AI training is a 'controversial tactic' and that it was done 'unbeknownst to the creators'. The article also selectively reports information by focusing on specific companies and individuals, while ignoring the fact that many reputable organizations such as Khan Academy, MIT, Harvard, NPR, BBC, The Wall Street Journal were also included in the dataset. The authors make editorializing statements by implying that using YouTube videos for AI training is a bad thing without providing any evidence to support this claim.
    • Our investigation found that subtitles from 173,536 YouTube videos were used by Silicon Valley heavyweights...
    • Tech companies are turning to controversial tactics to feed their data-hungry artificial intelligence models, vacuuming up books, websites, photos, and social media post, often unbeknownst to the creators.
    • Some of the material used to train AI also promoted conspiracies such as the ‘flat-earth theory.’
  • Fallacies (80%)
    The authors commit an appeal to authority fallacy by stating that 'An investigation by Proof News found' without providing any evidence or reasoning as to why Proof News is a reliable source for this information. They also make inflammatory statements such as 'controversial tactics' and 'vacuuming up books, websites, photos, and social media post, often unbeknownst to the creators.' without providing any evidence or context.
    • An investigation by Proof News found
    • companies did so despite YouTube’s rules against harvesting materials from the platform without permission
  • Bias (80%)
    The authors of the article use language that depicts the companies using YouTube videos for AI training as 'controversial tactics' and 'vacuuming up' materials unbeknownst to creators. They also mention that some of the material used to train AI promoted conspiracies, which could be seen as an attempt to demean or discredit the companies involved.
    • Our investigation found that subtitles from 173,536 YouTube videos were used by Silicon Valley heavyweights...
      • Some of the material used to train AI also promoted conspiracies such as the ‘flat-earth theory.’
        • Tech companies are turning to controversial tactics to feed their data-hungry artificial intelligence models, vacuuming up books, websites, photos, and social media post, often unbeknownst to the creators.
        • Site Conflicts Of Interest (100%)
          None Found At Time Of Publication
        • Author Conflicts Of Interest (100%)
          None Found At Time Of Publication

        91%

        • Unique Points
          • Apple Intelligence may have been trained using controversial sources without permission
          • One of the sources, EleutherAI’s Pile dataset, includes YouTube subtitles downloaded without permission
          • Use of YouTube subtitles is a breach of YouTube terms and conditions
          • Several companies including Apple, Salesforce and Anthropic have used the Pile dataset for AI training
          • The Pile dataset reportedly contains profanity and biases against certain groups
        • Accuracy
          No Contradictions at Time Of Publication
        • Deception (100%)
          None Found At Time Of Publication
        • Fallacies (80%)
          The author makes an appeal to authority when quoting Jennifer Martinez and the spokesperson from Anthropic. They also use inflammatory rhetoric by stating that 'Apple Intelligence has seemingly been trained on YouTube subtitles it had no right to.' However, they do not provide any evidence that Apple knew or should have known about the unauthorized use of YouTube subtitles in their training data.
          • ] Apple Intelligence may have been trained less legally and ethically than Apple believed [
        • Bias (80%)
          The author uses language that depicts the use of YouTube subtitles without permission as a 'gray area' and quotes Jennifer Martinez from Anthropic stating that there is a difference between using YouTube subtitles and using the videos. However, the author also mentions that Salesforce found profanity and biases against certain groups in the Pile dataset, which includes YouTube subtitles. The author does not explicitly state a bias towards or against Apple or any other company mentioned in the article, but by highlighting their use of controversial sources for training AI, the author may be implying a negative connotation.
          • However, it’s not only YouTube subtitles that have been gathered without permission. It’s claimed that Wikipedia has been used, as has documentation from the European Parliament.
            • It's apparently also a breach of YouTube terms and conditions, but that may be a more gray area than it should be.
              • Now, it’s claimed that the Pile used the text of those emails for its training.
                • Salesforce also confirmed that it had used the Pile in its building of an AI model for academic and research purposes.
                • Site Conflicts Of Interest (100%)
                  None Found At Time Of Publication
                • Author Conflicts Of Interest (100%)
                  None Found At Time Of Publication

                100%

                • Unique Points
                  • More than 170,000 YouTube videos were used to train AI systems for Apple, Anthropic, Nvidia, and Salesforce without permission.
                  • The dataset contains subtitles taken from YouTube videos belonging to over 48,000 channels.
                  • Over 100 videos from The Verge are included in the dataset.
                • Accuracy
                  No Contradictions at Time Of Publication
                • Deception (100%)
                  None Found At Time Of Publication
                • Fallacies (100%)
                  None Found At Time Of Publication
                • Bias (100%)
                  None Found At Time Of Publication
                • Site Conflicts Of Interest (100%)
                  None Found At Time Of Publication
                • Author Conflicts Of Interest (100%)
                  None Found At Time Of Publication