OpenAI, a leading artificial intelligence (AI) research laboratory, is developing a tool called Media Manager to enable creators and content owners to control how their works are used in AI research and training. The tool aims to address concerns raised by some creators regarding the use of their content for model training without their consent. OpenAI has faced criticism for scraping publicly available data from the web, including a recent lawsuit by eight prominent US newspapers.
Media Manager is expected to be introduced in 2025 and will allow creators and content owners to identify their works and specify how they want them included or excluded from AI research and training. The tool's goal is to have a standard in place, possibly through the industry steering committee OpenAI recently joined.
OpenAI has taken steps to meet content creators halfway by allowing artists to opt out of having their work used in image-generating models and letting website owners indicate via the robots.txt standard whether their content can be scraped for AI model training. The company also continues to ink licensing deals with large content owners, including news organizations, stock media libraries, and Q&A sites like Stack Overflow.
Some creators have described OpenAI's opt-out workflow for images as onerous and criticized the relatively little payment they receive. To address these concerns, third parties are attempting to build universal provenance and opt-out tools for generative AI. For instance, Spawning AI offers an app that identifies and tracks bots' IP addresses to block scraping attempts and a database where artists can register their works to disallow training by vendors who choose to respect the requests. Steg.AI and Imatag help creators establish ownership of their images by applying watermarks imperceptible to the human eye, while Nightshade poisons image data to render it useless or disruptive for AI model training.
OpenAI's new tool is a response to growing criticism of its approach to developing AI, which relies heavily on scraping publicly available data from the web. The company argues that fair use shields its practice of scraping public data and using it for model training. However, not everyone agrees with this argument.
As OpenAI works on Media Manager and other solutions to address content creators' concerns, the debate around ethically sourced training data continues to gain momentum. Some advocates argue for a regime where AI companies only train algorithms on data with explicit permission from creatives and rights holders.