Jordan Meyer and Mathew Dryhurst founded Spawning AI to develop tools that help artists control how their works are used online. Their latest project, Source.Plus, aims to curate โnon-infringingโ media for AI model training.
Source.Plusโ first initiative features a dataset with nearly 40 million public domain images and images under the Creative Commonsโ CC0 license, allowing creators to waive almost all legal interest in their works. Despite its smaller size compared to other generative AI training datasets, Meyer claims Source.Plusโ dataset is already โhigh-qualityโ enough to train state-of-the-art image-generating models.
โWith Source.Plus, weโre creating a universal โopt-inโ platform,โ Meyer said. โWe aim to make it easy for rights holders to offer their media for generative AI training on their terms and seamless for developers to integrate that media into their training workflows.โ
Table of Contents
ToggleRights Management
The ethical debate around training generative AI models, especially art-generating models like Stable Diffusion and OpenAIโs DALL-E 3, remains unresolved and has significant implications for artists.
Generative AI models โlearnโ to create outputs (e.g., photorealistic art) by training on vast quantities of data. Some developers argue fair use allows them to scrape data from public sources, regardless of copyright status. Others have tried compensating or crediting content owners for their contributions to training sets.
Meyer, Spawningโs CEO, believes no one has settled on the best approach yet.
โAI training often defaults to using the easiest available data, which hasnโt always been the most fair or responsibly sourced,โ he told TechCrunch in an interview. โArtists and rights holders have had little control over how their data is used for AI training, and developers have lacked high-quality alternatives that respect data rights.โ
Source.Plus, in limited beta, builds on Spawningโs existing tools for art provenance and usage rights management.
In 2022, Spawning launched HaveIBeenTrained, a site allowing creators to opt out of training datasets used by vendors partnered with Spawning, like Hugging Face and Stability AI. After raising $3 million from investors, including True Ventures and Seed Club Ventures, Spawning introduced ai.text for websites to โset permissionsโ for AI and Kudurru to defend against data-scraping bots.
Source.Plus is Spawningโs first effort to build and curate a media library in-house. The initial PD/CC0 image dataset can be used for commercial or research purposes, Meyer says.
Source.Plus isnโt just a repository for training data; itโs an enrichment platform supporting the training pipeline,โ he continued. โOur goal is to offer a high-quality, non-infringing CC0 dataset capable of supporting a powerful base AI model within the year.โ
Organizations like Getty Images, Adobe, Shutterstock, and AI startup Bria claim to use only fairly sourced data for model training. (Getty even calls its generative AI products โcommercially safe.โ) But Meyer says Spawning aims to set a โhigher barโ for fair data sourcing.
Source.Plus filters images for โopt-outsโ and other artist preferences, showing provenance information. It excludes images not licensed under CC0, including those requiring attribution. Spawning also monitors for copyright challenges from sources like Wikimedia Commons, where someone other than the creators indicates copyright status.
โWe meticulously validated the reported licenses of the images we collected, excluding any questionable licenses โ a step many โfairโ datasets donโt take,โ Meyer said.
Historically, problematic images, including violent and pornographic ones, have plagued training datasets.
Adobe touts its Firefly AI as more ethical than rivals like Midjourney. But it was actually trained on images from them. https://t.co/Ep1eadjQML
โ Bloomberg Technology (@technology) April 12, 2024
The LAION dataset maintainers had to pull one library offline after reports uncovered medical records and child sexual abuse depictions. Recently, a Human Rights Watch study found one of LAIONโs repositories included Brazilian childrenโs faces without their consent. Adobe Stock, used to train Adobeโs Firefly Image model, contained AI-generated images from rivals like Midjourney.
Spawningโs solution includes classifier models detecting nudity, gore, personal information, and other undesirable content. Recognizing no classifier is perfect, Spawning plans to let users adjust classifiersโ detection thresholds, Meyer says.
โWe employ moderators to verify data ownership,โ Meyer added. โWe also have remediation features where users can flag offending or possibly infringing works, and the data consumption trail can be audited.โ
Compensation
Programs compensating creators for generative AI training data contributions have had mixed results. Some rely on opaque metrics, while others pay unreasonably low amounts.
For example, Shutterstockโs contributor fund for artwork used to train generative AI models or licensed to third-party developers isnโt transparent about earnings, nor does it allow artists to set their own terms. One estimate pegs earnings at $15 for 2,000 images, a modest amount.
Once Source.Plus exits beta and expands beyond PD/CC0 datasets, it will differ from other platforms by allowing artists to set their own prices per download. Spawning will charge a flat rate fee โ a โtenth of a penny,โ Meyer says.
Customers can also pay $10 per month, plus the typical per-image download fee, for Source.Plus Curation, a subscription plan offering private image collection management, up to 10,000 monthly downloads, and early access to new features like โpremiumโ collections and data enrichment.
โWe provide guidance and recommendations based on industry standards and internal metrics, but contributors ultimately determine their own terms,โ Meyer said. โThis pricing model intentionally gives artists the lionโs share of revenue and allows them to set their own terms for participation. We believe this approach leads to higher payouts and greater transparency.โ
If Source.Plus gains the traction Spawning hopes, they plan to expand it to other media types, including audio and video. Spawning is in talks with firms to make their data available on Source.Plus and might build its own generative AI models using Source.Plus datasets.
โWe hope rights holders wanting to participate in the generative AI economy can receive fair compensation,โ Meyer said. โWe also hope artists and developers conflicted about engaging with AI can do so respectfully.โ
Spawning has a niche to carve out. Source.Plus seems like a promising attempt to involve artists in generative AI development and let them profit from their work.
As my colleague Amanda Silberling recently wrote, the rise of apps like Cara, which saw a surge after Meta announced it might train generative AI on Instagram content, shows the creative community is at a breaking point. Theyโre seeking alternatives to platforms they perceive as exploitative, and Source.Plus might be a viable one.
However, if Spawning always acts in artists’ best interests (a big if, considering Spawning is VC-backed), itโs uncertain if Source.Plus can scale as successfully as Meyer envisions. Social media has shown that moderating millions of pieces of user-generated content is a challenging problem.