In recent news, several tech giants, including Apple, Anthropic, Nvidia, and Salesforce, have been found to have used YouTube videos without permission for training their AI models. The subtitles from over 170,000 YouTube videos belonging to more than 48,000 channels were harvested and used in the dataset. This is a breach of YouTube's terms and conditions.
The dataset, known as 'The Pile,' was provided by a non-profit organization called EleutherAI for developers and academics. However, it appears that some of the largest tech companies have also made use of this publicly available dataset. Apple, for instance, used The Pile to train its OpenELM model before announcing new AI capabilities on iPhones and MacBooks.
The use of YouTube subtitles is a contentious issue as it raises ethical concerns regarding data ownership and privacy. Additionally, the dataset reportedly contains profanity and biases against certain groups, which could potentially influence the AI models' outputs.
It is important to note that while Apple and other companies may have used the dataset in good faith, they are still responsible for ensuring that their sources of training data are ethical and legal. YouTube has not yet responded to requests for comment on this matter.
This incident highlights the need for greater transparency and regulation in the use of training data for AI models. As technology continues to advance, it is crucial that companies prioritize ethical practices and respect the intellectual property rights of content creators.