DataChain for Physical AI & Robotics

An hour of fleet recording costs hardware, fuel, and real-world time. The dataset you assemble from it can't be re-collected. Yet most teams can't query their own archive to find the edge cases that matter for the next model iteration. Petabytes of MCAP, ROS bags, video, lidar, and CAN-bus sit in object storage, opaque after the upload completes. DataChain catalogs what's in the archive, mines edge cases on demand, and versions every training mix you assemble.

Same SDK from Python, Jupyter, or Claude Code / Cursor / Codex.

Works with your stack

Clouds

AWS·GCP·Azure·Nebius

Tools

Python·Jupyter·CI/CD

AI Agents

Claude Code·Cursor·Codex

How it works

Step 1

Catalog your archive

Point at S3, GCS, Azure, or Nebius. MCAP, ROS bags, parquet, and modality files stay where they are; DataChain indexes metadata, lineage, and processing history.

Step 2

Mine and label

Query the catalog to find episodes by route, weather, or sensor version. Dispatch labeling and embedding across 1 to 1000+ machines in your VPC.

Step 3

Version training mixes

Every assembled mix lands in Dataset DB as a named, immutable version. The next VLA iteration reads what exists; regression reports point at the exact version.

🎯 Retroactive mining: recall episodes without re-running the fleet

Petabyte archives are write-once, read-rarely. Every model iteration needs new training mixes; every regression needs the edge cases that surfaced last month. DataChain indexes what's recorded in the archive so you query it as a database. “All rain events on highway with sensor stack v5 and CAN-bus anomalies” becomes a 30-second filter, not a 3-week re-process.

Edge cases as queries: filter, join, and similarity-search across millions of episodes by route, weather, sensor version, label coverage, or anomaly signature.
Versioned training mixes: each assembled mix lands in Dataset DB as a named, immutable version. The next VLA iteration reads the same mix; the regression report points at the exact version.
No re-runs: mining queries the catalog the labeling and extraction passes already produced. Raw bytes stay put.

Recall an episode in seconds. Version every training mix.

🧠 Brain: archive catalog the team queries

Every recording lands with a typed record: fleet, route, weather, sensor stack version, hardware revisions, timestamps. Every labeling or processing pass adds derived columns (3D boxes, semantic segmentation, calibration, scene metadata). Lineage threads it all.

What's there: episodes by fleet, route, weather, sensor version; raw modality counts; storage URIs.
What's been done: per-episode coverage of 3D boxes, segmentation, tracking; calibration version; QA status.
What's worth mining: anomaly-flagged episodes, hand-labeled corner cases, regression triggers.

The archive becomes a database. Edge cases become a query.

🦾 Distributed compute the whole team can use

DataChain Studio clusters dashboard showing CPU and GPU pools attached to a workspace

A laptop can't run a VLM autolabeler across 500K episodes overnight. With DataChain, a curation engineer dispatches across 1000+ machines in your VPC: autolabeling, embedding generation, anomaly scanning, training-mix synthesis. Same SDK from notebook, script, or overnight agent.

Attach multiple clusters: CPU pools, GPU pools, high-memory pools.
Scale from 1 to 1000+ machines from the SDK, no extra framework.
DataChain manages the clusters for you. Workers spin up on demand, spin down when idle.
Bytes never leave your storage. BYOC by default: compute runs in your AWS, GCP, Azure, or Nebius account, behind your VPC, under your existing IAM.

Your scripts and agents become curation engineers with infrastructure access.

🤖 Where Claude Code actually works on fleet data

Software teams 10× with Claude Code, Cursor, and Codex. Curation teams don't: agents can't scan a petabyte archive, can't synthesize a balanced training mix, can't write distributed labeling jobs without typed context. DataChain gives them the archive catalog, lineage, and clusters they were missing.

Shared: every agent-mined dataset lands in the team's registry, with source code, parameters, episode IDs, and author attached.
Mining at agent speed: an agent runs a 30-machine query across the archive to find every night-rain edge case, then assembles a balanced training mix in one session.
Code generation for curation: vanilla agents can't write distributed labeling pipelines, and 0% of their tasks materialize a reusable dataset. DataChain agents produce typed datasets that scale across clusters; cost-of-failure drops 2.7× (9× on image work).

Claude Code, finally productive on fleet data.

Production-grade compliance

Fleet data never leaves your perimeter. DataChain is SOC2-certified and deploys as BYOC: compute and storage stay in your AWS, GCP, Azure, or Nebius account, under your IAM. Every labeling, mining, or training-mix pass is traceable; every dataset version immutable. Supplier IP boundaries and data-residency requirements carry through unchanged.

Audit-ready by default.

Customer story: Alps Alpine Europe and the petabyte-archive pattern

Alps Alpine Europe runs DataChain as the data-management layer on top of cloud storage.

“DataChain added real value to our workflows, versioned datasets, automated ETL, and MLOps, all in Python. If you need a data management layer on top of cloud storage, give it a try.”
Nikhilesh Saggere | Lead Engineer, Alps Alpine Europe

The same shape generalizes to fleet-scale curation. A perception team curating from a 5 PB MCAP archive across a 50-vehicle fleet and 18 months of recordings, with DataChain in place:

Edge-case query in seconds. “All highway-rain events with CAN-bus anomalies” returns 4,200 episodes from 5 PB. Old workflow: 3-week Spark job plus manual curation.
One typed catalog across the fleet. Episode metadata, labels, processing history, all queryable from Python.
Versioned training mixes. Each VLA iteration reads a named, immutable mix; regression reports point at the exact version.
BYOC. Fleet data stays in the OEM's AWS and Nebius accounts; DataChain orchestrates without touching the bytes.

Petabytes become queryable inventory. Edge cases compound across training iterations.

How DataChain compares to OEM-built data stacks

The Cruise / Waymo pattern: years of in-house build, tens-to-hundreds of data engineers, bespoke labeling pipelines. DataChain is the off-the-shelf alternative for teams that don't want to build a five-year data platform first.

OEM-built data stack

DataChain

Sensor sync

Custom abstraction per sensor (video, lidar, IMU, CAN)

Native typed records, multi-rate timestamps preserved

Edge-case mining

Bespoke labeling DB + curation scripts

Catalog query: filter, join, similarity

Training mix versioning

S3 naming convention

Immutable Dataset DB versions with full lineage

Curation self-service

Wait on data infrastructure team

Run from notebook, script, or AI agent

Distributed compute

Custom Spark on Kubernetes

1 to 1000+ machines from the SDK

Time to first usable dataset

Years of in-house build

Weeks

BYOC / data residency

Self-hosted infrastructure

Control plane only; your AWS, GCP, Azure, or Nebius

Frequently asked

Do my fleet bytes ever leave my cloud account?

No. DataChain runs as BYOC: compute and storage stay in your AWS, GCP, Azure, or Nebius account. The control plane orchestrates; the data plane stays yours.

How does this work with Rerun, ROS, or other recording tools?

Parallel, not replacement. DataChain is offline-by-design and works after the recording lands in object storage. Streaming substrates (Rerun, ROS, MCAP) own real-time recording; DataChain owns offline catalog, mining, and training-mix assembly on top.

What compliance certifications do you have?

SOC2. For ISO 26262, SOTIF, supplier IP agreements, GDPR, or any other framework your team operates under, BYOC means your existing controls carry through unchanged.

Can our curation team adopt this without a long learning curve?

It's Python they already know plus the DataChain SDK. Existing customers running curation on petabyte archives onboard their first chains within weeks.

How do we mine edge cases retroactively?

Mining queries the catalog the labeling and extraction passes already produced. No re-runs over raw bytes. "All highway-rain events with CAN-bus anomalies" is a filter, not a Spark job.

Mine Edge Cases from Petabytes of Fleet Recordings