Spawning AI’s Mission to Create Ethical AI Datasets

Spawning AI Aims to Develop Ethical AI Datasets

Jordan Meyer and Mathew Dryhurst founded Spawning AI to develop tools that help artists control how their works are used online. Their latest project, Source.Plus, aims to curate โ€œnon-infringingโ€ media for AI model training.

Source.Plusโ€™ first initiative features a dataset with nearly 40 million public domain images and images under the Creative Commonsโ€™ CC0 license, allowing creators to waive almost all legal interest in their works. Despite its smaller size compared to other generative AI training datasets, Meyer claims Source.Plusโ€™ dataset is already โ€œhigh-qualityโ€ enough to train state-of-the-art image-generating models.

โ€œWith Source.Plus, weโ€™re creating a universal โ€˜opt-inโ€™ platform,โ€ Meyer said. โ€œWe aim to make it easy for rights holders to offer their media for generative AI training on their terms and seamless for developers to integrate that media into their training workflows.โ€

Rights Management

Generative AI Models

The ethical debate around training generative AI models, especially art-generating models like Stable Diffusion and OpenAIโ€™s DALL-E 3, remains unresolved and has significant implications for artists.

Generative AI models โ€œlearnโ€ to create outputs (e.g., photorealistic art) by training on vast quantities of data. Some developers argue fair use allows them to scrape data from public sources, regardless of copyright status. Others have tried compensating or crediting content owners for their contributions to training sets.

Meyer, Spawningโ€™s CEO, believes no one has settled on the best approach yet.

โ€œAI training often defaults to using the easiest available data, which hasnโ€™t always been the most fair or responsibly sourced,โ€ he told TechCrunch in an interview. โ€œArtists and rights holders have had little control over how their data is used for AI training, and developers have lacked high-quality alternatives that respect data rights.โ€

Source.Plus, in limited beta, builds on Spawningโ€™s existing tools for art provenance and usage rights management.

In 2022, Spawning launched HaveIBeenTrained, a site allowing creators to opt out of training datasets used by vendors partnered with Spawning, like Hugging Face and Stability AI. After raising $3 million from investors, including True Ventures and Seed Club Ventures, Spawning introduced ai.text for websites to โ€œset permissionsโ€ for AI and Kudurru to defend against data-scraping bots.

Source.Plus is Spawningโ€™s first effort to build and curate a media library in-house. The initial PD/CC0 image dataset can be used for commercial or research purposes, Meyer says.

Source.Plus isnโ€™t just a repository for training data; itโ€™s an enrichment platform supporting the training pipeline,โ€ he continued. โ€œOur goal is to offer a high-quality, non-infringing CC0 dataset capable of supporting a powerful base AI model within the year.โ€

Organizations like Getty Images, Adobe, Shutterstock, and AI startup Bria claim to use only fairly sourced data for model training. (Getty even calls its generative AI products โ€œcommercially safe.โ€) But Meyer says Spawning aims to set a โ€œhigher barโ€ for fair data sourcing.

Source.Plus filters images for โ€œopt-outsโ€ and other artist preferences, showing provenance information. It excludes images not licensed under CC0, including those requiring attribution. Spawning also monitors for copyright challenges from sources like Wikimedia Commons, where someone other than the creators indicates copyright status.

โ€œWe meticulously validated the reported licenses of the images we collected, excluding any questionable licenses โ€” a step many โ€˜fairโ€™ datasets donโ€™t take,โ€ Meyer said.

Historically, problematic images, including violent and pornographic ones, have plagued training datasets.


The LAION dataset maintainers had to pull one library offline after reports uncovered medical records and child sexual abuse depictions. Recently, a Human Rights Watch study found one of LAIONโ€™s repositories included Brazilian childrenโ€™s faces without their consent. Adobe Stock, used to train Adobeโ€™s Firefly Image model, contained AI-generated images from rivals like Midjourney.

Spawningโ€™s solution includes classifier models detecting nudity, gore, personal information, and other undesirable content. Recognizing no classifier is perfect, Spawning plans to let users adjust classifiersโ€™ detection thresholds, Meyer says.

โ€œWe employ moderators to verify data ownership,โ€ Meyer added. โ€œWe also have remediation features where users can flag offending or possibly infringing works, and the data consumption trail can be audited.โ€

Compensation

Shutterstock AI Image Generator

Programs compensating creators for generative AI training data contributions have had mixed results. Some rely on opaque metrics, while others pay unreasonably low amounts.

For example, Shutterstockโ€™s contributor fund for artwork used to train generative AI models or licensed to third-party developers isnโ€™t transparent about earnings, nor does it allow artists to set their own terms. One estimate pegs earnings at $15 for 2,000 images, a modest amount.

Once Source.Plus exits beta and expands beyond PD/CC0 datasets, it will differ from other platforms by allowing artists to set their own prices per download. Spawning will charge a flat rate fee โ€” a โ€œtenth of a penny,โ€ Meyer says.

Customers can also pay $10 per month, plus the typical per-image download fee, for Source.Plus Curation, a subscription plan offering private image collection management, up to 10,000 monthly downloads, and early access to new features like โ€œpremiumโ€ collections and data enrichment.

โ€œWe provide guidance and recommendations based on industry standards and internal metrics, but contributors ultimately determine their own terms,โ€ Meyer said. โ€œThis pricing model intentionally gives artists the lionโ€™s share of revenue and allows them to set their own terms for participation. We believe this approach leads to higher payouts and greater transparency.โ€

Spawning AI's Mission

If Source.Plus gains the traction Spawning hopes, they plan to expand it to other media types, including audio and video. Spawning is in talks with firms to make their data available on Source.Plus and might build its own generative AI models using Source.Plus datasets.

โ€œWe hope rights holders wanting to participate in the generative AI economy can receive fair compensation,โ€ Meyer said. โ€œWe also hope artists and developers conflicted about engaging with AI can do so respectfully.โ€

Spawning has a niche to carve out. Source.Plus seems like a promising attempt to involve artists in generative AI development and let them profit from their work.

As my colleague Amanda Silberling recently wrote, the rise of apps like Cara, which saw a surge after Meta announced it might train generative AI on Instagram content, shows the creative community is at a breaking point. Theyโ€™re seeking alternatives to platforms they perceive as exploitative, and Source.Plus might be a viable one.

However, if Spawning always acts in artists’ best interests (a big if, considering Spawning is VC-backed), itโ€™s uncertain if Source.Plus can scale as successfully as Meyer envisions. Social media has shown that moderating millions of pieces of user-generated content is a challenging problem.

Leave a Comment

  +  18  =  22

Related Posts

Take a closer look at tailored content that aligns with your interests, allowing you to delve into the realm of business and entrepreneurship. Utilize our articles to explore specific topics in greater depth, gaining invaluable insights and enhancing your understanding of the business world.