× Home Datasets All Datasets CodeData — 50K+ Codebases Vision & Image Language & Text Medical & Clinical Custom Datasets Platform Solutions Data Pipeline Enterprise API Access Industries Healthcare AI Autonomous Systems Retail & Commerce Finance & Risk Resources About Contact
Data Partner Programme
Your Codebase Trains
the World's Best AI Models.

DATALENT is building the world's largest private codebase training dataset — 50,000+ repositories and growing. If you own production code, we want to partner with you. Your codebase becomes part of something that genuinely moves AI forward.

🔒 Your IP stays yours
NDA before any review
🕒 Response in 5 business days
💰 Compensation discussed on call
partner_pipeline.log
LIVE
FE
FinanceEngine v3
Python · 380K LOC · Fintech
Live
HA
HealthAPI Platform
Go · 220K LOC · Healthcare
Live
LM
LogisticsMap Core
TypeScript · 195K LOC · SaaS
In Review
DC
DataCompute Engine
Rust · 140K LOC · Infrastructure
Submitted
Why Partner With Us

What Happens When You Contribute Your Code

Your production codebase contains patterns, decisions, and engineering context that no synthetic data generator can replicate.

🧠
Your Code Becomes Training Signal
We extract 4,055+ structured LLM training tasks from your repository — PR reviews, bug fixes, code explanations, test generation, security analysis. Each task is verified by senior engineers before training any model.
🌎
Real Engineering Context, Preserved
Commit messages, review comments, and architectural decisions in your codebase carry reasoning that doesn't exist on public GitHub. Your engineering culture becomes part of how AI understands software.
💰
Fair Compensation, Discussed Directly
Every codebase is different — size, domain, language, uniqueness. We don't post prices on a website. Compensation is discussed personally on a 15-minute discovery call after the initial technical review.
🔒
Your IP is Fully Protected
You retain ownership of your codebase. We sign an NDA before any technical review. AI buyers never receive your source code — only the structured training tasks extracted from it.
Simple, Transparent Process
Submit in minutes. We respond within 5 business days with a clear decision. No complex legal hurdles or multi-month negotiations. If your codebase qualifies, the process from submission to agreement is straightforward.
🏆
Join the World's Largest Collection
Your codebase joins 50,000+ others — the largest private codebase training dataset in the world. AI models trained on this data will power code assistants, debugging tools, and engineering AI for years to come.
Who Qualifies

What We're Looking For

We evaluate every codebase individually. These are the categories we actively source right now.

🏢
Production SaaS Platforms
Live or formerly live B2B/B2C SaaS codebases with real user traffic history. Engineering decisions made under real constraints are what we value most.
B2B SaaSB2C AppsPlatformsMicroservices
📉
Financial & Fintech Systems
Payment processors, trading systems, risk engines, banking platforms. Financial code contains high-complexity logic with significant real-world consequences.
PaymentsTradingRiskBanking
Healthcare & Clinical Software
EHR systems, medical device software, clinical data pipelines. Healthcare code operates under HIPAA constraints that produce unusually rigorous engineering standards.
EHRHIPAAClinicalMedical Devices
Developer Tools & Infrastructure
CI/CD platforms, monitoring tools, databases, API gateways, cloud infrastructure. Developer-tool codebases tend to have the highest code quality and most instructive engineering patterns.
DevToolsInfrastructureDatabasesAPIs
🔴
What doesn't qualify
Tutorial repos, personal side projects under 10K lines, code already public on GitHub, AI-generated codebases, or repositories without a real commit history. We can tell the difference — please don't submit these.
Process

From Submission to Agreement

Five steps. Designed to respect your time.

1
Step 01
Submit the Form Below
Tell us about your codebase — what it does, the tech stack, approximate size, and how you can provide access. Takes under 5 minutes. No code shared at this stage.
2
Step 02
We Review & Respond
Our technical team evaluates the submission against our current acquisition criteria. We respond within 5 business days with either an invitation to a discovery call, or a clear reason why we're passing.
What we evaluate: domain relevance, estimated task yield, language coverage, codebase uniqueness, and how it fills gaps in our existing collection.
3
Step 03
NDA & Discovery Call (15 min)
Before any technical review, we send and execute a mutual NDA. The discovery call is a short conversation to understand the codebase, discuss compensation expectations, and answer your questions. No commitment required at this stage.
4
Step 04
Technical Evaluation
With the NDA in place, we run our automated pipeline against a read-only copy of your repository. This generates a precise task yield estimate — how many high-quality training tasks your codebase will produce.
Your source code is never shared with AI buyers or third parties during or after this step. Access is restricted to our internal pipeline only.
5
Step 05
Agreement & Compensation
Based on the technical evaluation, we present a compensation proposal. Compensation depends on task yield, domain, language coverage, and exclusivity terms. We reach agreement on a data licensing contract, and payment is made on signing.
What We Source

Codebase Types in High Demand

These three categories receive the fastest evaluation and strongest interest right now.

High Demand
Highest Priority
Complex Domain Systems
Production systems handling financial transactions, clinical data, logistics routing, or enterprise resource planning. The complexity and real-world stakes make these exceptionally valuable — they contain edge cases and decision logic that simpler code never shows.
FinanceHealthcareLogisticsERP
🌐
Strong Demand
Multi-Language Polyglot Repos
Codebases using 3+ languages — a Python data layer, Go API, TypeScript frontend, SQL schemas, and shell automation. Cross-language training signal is extremely hard to synthesise and fills a specific gap in our collection.
PythonGoTypeScriptRustJava
📋
Active Collection
Long-Running Projects (5+ Years)
Older codebases have rich commit histories, extensive PR review threads, and documented architectural evolution. A codebase with 5+ years of history generates significantly more training signal per line of code — the history is the value.
Legacy SystemsLarge HistoryMany Contributors
IP Protection

Exactly How We Protect Your Code

The most common concern is intellectual property. Here is exactly how we handle it — no ambiguity.

You Retain Full Ownership
The data licensing agreement covers the right to extract training tasks from your code. It does not transfer IP ownership. Your codebase remains entirely yours.
NDA Before Technical Access
We execute a mutual NDA before any member of our team reads your code. No exceptions. The NDA protects both your trade secrets and the confidentiality of the evaluation.
Source Code Never Shared with Buyers
AI buyers purchase structured training tasks — instruction-response pairs extracted from your code. They never receive, see, or can derive your source code from the training output.
Access Controlled and Time-Limited
Repository access during evaluation is read-only, time-limited, and fully logged. Access controls and time limits are documented in the licensing agreement.
PII and Secrets Automatically Removed
Our pipeline scans for and removes API keys, credentials, personal data, and other sensitive material before any task is generated. Nothing sensitive from your codebase appears in the training output.
Exclusivity Options Available
If you prefer training tasks from your codebase are sold exclusively to a single buyer, or not sold to competitors, we offer exclusivity agreements. Exclusivity is reflected in the compensation terms.
Submit Your Codebase

Start the Conversation

No code shared at this stage. We review and respond within 5 business days.

Codebase Submission
Tell us the basics so we can evaluate fit. Takes under 5 minutes.

By submitting, you confirm this codebase is yours to license. We'll send an NDA before any technical review. Compensation is discussed on the discovery call — no pricing on this page.

Submission Received
Our technical team will review your submission and respond within 5 business days. If there's a fit, you'll receive a calendar invite for a 15-minute discovery call.
FAQ

Common Questions

Do I have to give up ownership of my codebase?
No. You retain full intellectual property ownership. The agreement is a data licensing contract — you grant DATALENT the right to extract training tasks from your code and use them commercially. Ownership of the codebase stays with you entirely.
Can I submit a codebase that's still in active development?
Yes. Many of our partner codebases are live production systems. We take a snapshot at the time of the agreement — ongoing commits after that date are not included unless you choose to extend the arrangement.
How is compensation determined?
Compensation is based on actual task yield from the technical evaluation, plus factors like domain rarity, language coverage, codebase uniqueness, and exclusivity terms. We don't publish rates because every codebase is different. The discovery call is where we align on expectations before the technical evaluation begins.
Will AI models be able to reproduce my code?
No. Training tasks are instruction-response pairs that teach models to reason about code — they don't contain verbatim copies of your source files. We specifically structure the extraction to prevent source code reconstruction, and this is documented in the licensing agreement.
What if my codebase has customer data or credentials in the history?
Our pipeline automatically scans for and redacts PII, credentials, API keys, and customer-identifiable data before any task generation. This is standard practice. If your codebase has significant PII exposure, we'll discuss how to handle it — it doesn't automatically disqualify the submission.
My startup shut down. Can I still submit the codebase?
Yes — inactive codebases from shut-down companies are often exactly what we're looking for. If you're a founder or CTO who retains rights to the codebase, you can submit it. You'll need to confirm you have authority to license it before the process goes further.

Ready to Contribute?

Submit your codebase in under 5 minutes. No code shared at this stage. Our team responds within 5 business days.

Submit Your Codebase → Talk to Our Team First