An hour of fleet recording costs hardware, fuel, and real-world time. The dataset you assemble from it can't be re-collected. Yet most teams can't query their own archive to find the edge cases that matter for the next model iteration. Petabytes of MCAP, ROS bags, video, lidar, and CAN-bus sit in object storage, opaque after the upload completes. DataChain catalogs what's in the archive, mines edge cases on demand, and versions every training mix you assemble.
Same SDK from Python, Jupyter, or Claude Code / Cursor / Codex.
Works with your stack
How it works
Point at S3, GCS, Azure, or Nebius. MCAP, ROS bags, parquet, and modality files stay where they are; DataChain indexes metadata, lineage, and processing history.
Query the catalog to find episodes by route, weather, or sensor version. Dispatch labeling and embedding across 1 to 1000+ machines in your VPC.
Every assembled mix lands in Dataset DB as a named, immutable version. The next VLA iteration reads what exists; regression reports point at the exact version.
🎯 Retroactive mining: recall episodes without re-running the fleet
Petabyte archives are write-once, read-rarely. Every model iteration needs new training mixes; every regression needs the edge cases that surfaced last month. DataChain indexes what's recorded in the archive so you query it as a database. “All rain events on highway with sensor stack v5 and CAN-bus anomalies” becomes a 30-second filter, not a 3-week re-process.
- Edge cases as queries: filter, join, and similarity-search across millions of episodes by route, weather, sensor version, label coverage, or anomaly signature.
- Versioned training mixes: each assembled mix lands in Dataset DB as a named, immutable version. The next VLA iteration reads the same mix; the regression report points at the exact version.
- No re-runs: mining queries the catalog the labeling and extraction passes already produced. Raw bytes stay put.
Recall an episode in seconds. Version every training mix.
🧠 Brain: archive catalog the team queries
Every recording lands with a typed record: fleet, route, weather, sensor stack version, hardware revisions, timestamps. Every labeling or processing pass adds derived columns (3D boxes, semantic segmentation, calibration, scene metadata). Lineage threads it all.
- What's there: episodes by fleet, route, weather, sensor version; raw modality counts; storage URIs.
- What's been done: per-episode coverage of 3D boxes, segmentation, tracking; calibration version; QA status.
- What's worth mining: anomaly-flagged episodes, hand-labeled corner cases, regression triggers.
The archive becomes a database. Edge cases become a query.
🦾 Distributed compute the whole team can use

A laptop can't run a VLM autolabeler across 500K episodes overnight. With DataChain, a curation engineer dispatches across 1000+ machines in your VPC: autolabeling, embedding generation, anomaly scanning, training-mix synthesis. Same SDK from notebook, script, or overnight agent.
- Attach multiple clusters: CPU pools, GPU pools, high-memory pools.
- Scale from 1 to 1000+ machines from the SDK, no extra framework.
- DataChain manages the clusters for you. Workers spin up on demand, spin down when idle.
- Bytes never leave your storage. BYOC by default: compute runs in your AWS, GCP, Azure, or Nebius account, behind your VPC, under your existing IAM.
Your scripts and agents become curation engineers with infrastructure access.
🤖 Where Claude Code actually works on fleet data
Software teams 10× with Claude Code, Cursor, and Codex. Curation teams don't: agents can't scan a petabyte archive, can't synthesize a balanced training mix, can't write distributed labeling jobs without typed context. DataChain gives them the archive catalog, lineage, and clusters they were missing.
- Shared: every agent-mined dataset lands in the team's registry, with source code, parameters, episode IDs, and author attached.
- Mining at agent speed: an agent runs a 30-machine query across the archive to find every night-rain edge case, then assembles a balanced training mix in one session.
- Code generation for curation: vanilla agents can't write distributed labeling pipelines, and 0% of their tasks materialize a reusable dataset. DataChain agents produce typed datasets that scale across clusters; cost-of-failure drops 2.7× (9× on image work).
Claude Code, finally productive on fleet data.
Production-grade compliance
Fleet data never leaves your perimeter. DataChain is SOC2-certified and deploys as BYOC: compute and storage stay in your AWS, GCP, Azure, or Nebius account, under your IAM. Every labeling, mining, or training-mix pass is traceable; every dataset version immutable. Supplier IP boundaries and data-residency requirements carry through unchanged.
Audit-ready by default.
Customer story: Alps Alpine Europe and the petabyte-archive pattern
Alps Alpine Europe runs DataChain as the data-management layer on top of cloud storage.
“DataChain added real value to our workflows, versioned datasets, automated ETL, and MLOps, all in Python. If you need a data management layer on top of cloud storage, give it a try.”
The same shape generalizes to fleet-scale curation. A perception team curating from a 5 PB MCAP archive across a 50-vehicle fleet and 18 months of recordings, with DataChain in place:
- Edge-case query in seconds. “All highway-rain events with CAN-bus anomalies” returns 4,200 episodes from 5 PB. Old workflow: 3-week Spark job plus manual curation.
- One typed catalog across the fleet. Episode metadata, labels, processing history, all queryable from Python.
- Versioned training mixes. Each VLA iteration reads a named, immutable mix; regression reports point at the exact version.
- BYOC. Fleet data stays in the OEM's AWS and Nebius accounts; DataChain orchestrates without touching the bytes.
Petabytes become queryable inventory. Edge cases compound across training iterations.
How DataChain compares to OEM-built data stacks
The Cruise / Waymo pattern: years of in-house build, tens-to-hundreds of data engineers, bespoke labeling pipelines. DataChain is the off-the-shelf alternative for teams that don't want to build a five-year data platform first.
Frequently asked
No. DataChain runs as BYOC: compute and storage stay in your AWS, GCP, Azure, or Nebius account. The control plane orchestrates; the data plane stays yours.
Parallel, not replacement. DataChain is offline-by-design and works after the recording lands in object storage. Streaming substrates (Rerun, ROS, MCAP) own real-time recording; DataChain owns offline catalog, mining, and training-mix assembly on top.
SOC2. For ISO 26262, SOTIF, supplier IP agreements, GDPR, or any other framework your team operates under, BYOC means your existing controls carry through unchanged.
It's Python they already know plus the DataChain SDK. Existing customers running curation on petabyte archives onboard their first chains within weeks.
Mining queries the catalog the labeling and extraction passes already produced. No re-runs over raw bytes. "All highway-rain events with CAN-bus anomalies" is a filter, not a Spark job.