One-Click Reproducibility for Every Dataset

Lineage, versioning, and authorship are not features. They are structural properties of how DataChain stores results.

Every dataset DataChain creates carries its complete provenance automatically: input dataset versions, source code (verbatim), parameters, author, and timestamp. There is no separate metadata system to maintain, no manual logging step, no forgotten experiments. Audit-ready means audit-derivable from the storage itself.

DataChain is the Context Layer for Unstructured Data: the operational substrate for files, documents, and multimodal records. Inside Claude Code, Cursor, and Codex it works as the data harness that pairs with the code harness. Measured under Claude Code in DL4C @ ICML 2026: every materialized result carries schema, version, and lineage on disk, ready for human oversight.

Lineage is structural, not optional

A store that cannot explain how its records were produced is a store that gets rebuilt from scratch. Dataset DB captures four things on every .save():

  • Dependencies. Parent dataset versions and storage URIs for every input file.
  • Source code. The full script that produced the dataset, stored verbatim.
  • Author. The person or service account that ran the chain.
  • Creation time. Exact timestamp of materialization.

Nothing is declared. Everything is recorded as a structural consequence of running the chain. This is what separates Dataset DB from catalog-style systems that drift the moment work moves faster than maintenance.

Two-level identity: name plus immutable version

A dataset has a durable name and an immutable version. Names organize. Versions audit.

  • Append a new version when a chain re-runs; prior versions never change.
  • Diff two versions on schema, row count, or column values without re-running the producing chain.
  • Pin downstream chains to a specific upstream version so reproducibility is deterministic across teammates and across time.

One-click reconstruction

The full chain that produced a dataset (parents, code, parameters, environment) is queryable from its lineage. Re-running it deposits a new, identical version. Auditors get a defensible reconstruction; engineers get a reproducible debug.

  • View lineage as a graph in Studio. Walk back to source storage URIs and the script that ran.
  • Re-execute against the original inputs or against a newer upstream version, with delta updates only over what changed.
  • Compare results across versions to localize regressions or document corpus drift.

Reproducibility is a property of how data is stored, not a checklist.