Generative artificial intelligence has a knowledge problem.
For years, the everyday approach to constructing gen AI models has been to assemble as much data as possible by scraping vast swaths of the web, training at an unlimited scale and coping with the results later. The result has been increasingly powerful technology, but in addition growing concerns about bias, consent, ownership and the uneven distribution of value created from the world’s information.
Mozilla Data Collective was created to fill the gaps on this model.
The organization, which launched last November, is attempting to create a distinct type of marketplace for AI data built around community ownership, consent and what founder and Chief Executive E.M. Lewis-Jong calls “fair value exchange.”
“We want clean, abundant, contextualized, consentful datasets to construct AI models price having,” Lewis-Jong said in a recent e-mail interview. “It’s a giant, structural problem, and it requires a structural solution.”
Datasets assembled through indiscriminate web scraping often reproduce the identical limitations and biases found online, Lewis-Jong said. Entire languages, cultures and communities remain underrepresented in modern AI systems, while many creators have little visibility into how their content is used. Governments world wide are also increasingly scrutinizing the legal foundations of large-scale data collection, creating recent compliance challenges for technology corporations.
Mozilla Data Collective addresses those issues by putting communities directly into the information supply chain. Reasonably than treating data as a resource to be extracted, the organization views it as something that ought to remain under the control of the individuals who create it.
Rooted in speech
The concept emerged partly from Mozilla’s experience with Common Voice, its long-running initiative to gather speech data from volunteers world wide. Common Voice demonstrated that individuals are willing to contribute data once they imagine their contributions are meaningful they usually have a voice in how the project is governed. Greater than a half-million contributors have participated across a whole lot of languages, helping create certainly one of the world’s largest publicly available voice datasets.
The rise of generative AI complicated that equation. Communities that had enthusiastically contributed data began asking tougher questions on who ultimately advantages from open datasets once they are absorbed into increasingly concentrated and opaque AI ecosystems. Some continued to favor fully open licensing models, while others wanted more transparency, control or compensation. Mozilla Data Collective has created licenses and policies to accommodate those different preferences.
In its model, sovereignty doesn’t necessarily mean restricting access. As a substitute, it gives communities the power to make a decision for themselves how their data will likely be used. Contributors can decide to share data openly, require attribution, limit use to educational or research purposes, restrict access geographically or seek compensation. The critical principle is that those decisions belong to data creators slightly than to an intermediary platform.
The organization argues that this approach is increasingly necessary as AI systems expand into languages and cultural contexts which have historically received little attention from technology corporations.
Today, the collective hosts a whole lot of curated datasets representing greater than 300 languages. Its collection includes Hazargi literature from Afghanistan, oral histories within the Mada language from Cameroon, and Romansh newspapers from Switzerland. A lot of these resources could be difficult or unimaginable to seek out through conventional industrial data channels.
‘Mission-locked’ enterprise
The organization’s unusual governance structure is meant to bolster that mission. Mozilla Data Collective operates as what Lewis-Jong describes as a “mission-locked British social enterprise.” Which means “our purpose is baked into our governance structure at multiple levels,” Lewis-Jong said. “We exist to provide communities ownership and agency over their data, and enable them to define and drive fair value exchange on their very own terms.”
The structure was chosen to avoid what the organization sees as the constraints of each traditional nonprofit and standard for-profit models. Nonprofits can struggle to construct sustainable infrastructure at scale, while venture-backed startups face pressure to prioritize growth and monetization over community interests.
The collective’s success is measured each by financial performance and by mission-related objectives. Lewis-Jong said this alignment is important because many technology corporations eventually encounter tension between their stated mission and the incentives created by their revenue models. “We’re held to a double bottom line,” he said. “If we don’t hit our mission stage gates, we don’t get to exist.”
With a $10 million initial commitment from the Mozilla Foundation, Mozilla Data Collective has some wiggle room with revenue. It doesn’t take a percentage of the fees communities decide to charge for his or her datasets. As a substitute, contributors receive the complete amount, while downloaders pay a separate platform fee to cover infrastructure and operating costs. The goal, Lewis-Jong said, is to encourage transparency and collective bargaining slightly than the obscure brokerage arrangements that always characterize data markets.
Creator control
The organization also places significant emphasis on curation and quality control. Every participating organization and dataset is reviewed before being accepted onto the platform. Copyrighted content is rejected if uploaders don’t hold the essential rights and fair-use claims aren’t deemed justification for distribution. The platform combines legal, technical and community safeguards intended to supply clear details about a dataset’s provenance and permissions.
Recent recent platform capabilities give data contributors greater control over access and compensation. Amongst them are tools that allow dataset owners to approve access requests, a conversational assistant that helps developers discover relevant datasets and a forthcoming compensation system that may enable contributors to determine licensing terms and pricing.
The long-term vision shouldn’t be necessarily to compete directly with the big data brokers that currently dominate AI training pipelines, Lewis-Jong said. As a substitute, the group sees itself as creating another model that connects developers with communities historically missed by mainstream data markets. He described the platform less as a broker and more as a bridge.
Mozilla Data Collective is betting that the longer term of AI would require greater than greater models and bigger datasets. It’ll require recent institutions that balance innovation with consent, participation and trust to make sure the individuals who create the world’s data have a meaningful role in determining the way it’s used.
Image: Mozilla Data Collective
Support our mission to maintain content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with greater than 11,400 tech and business leaders shaping the longer term through a singular trusted-based network.
About SiliconANGLE Media
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our recent proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to assist technology corporations make data-driven decisions and stay on the forefront of industry conversations.

