DataChain Runs Where Your Data Lives

Buckets, VPC, and IAM stay unchanged. DataChain orchestrates compute over your data; it never moves it.

Raw data does not move. Bytes stay in your buckets, behind your VPC, under your existing IAM policies. DataChain runs as orchestration over your existing infrastructure: define clusters, run chains, get back typed datasets. Compliance becomes a property of the deployment, not a separate workflow.

DataChain is the Context Layer for Unstructured Data, and unlike warehouse-backed context layers it never ingests your bytes. Files stay in object storage; only typed records, schemas, and lineage live in Dataset DB.

Bytes never leave your storage

Dataset DB stores file pointers, schemas, and derived columns. Raw bytes stay in S3, GCS, Azure, or on-prem object storage. Compute reads bytes directly from your storage layer and writes typed records back to Dataset DB. No staging copies, no ingestion pipelines, no data warehouse to govern separately.

  • File pointers, not file copies. Every dataset row references its source by URI.
  • Selective file-part access. Stream a clip, decode one slice, read one column.
  • Egress costs match what you would have paid to read the file anyway. No duplication.

No data movement, no duplication, no second copy to govern.

Compute follows the data, not the other way around

DataChain orchestrates compute over the clusters you operate. Define a CPU pool, a GPU pool, and a high-memory pool. Chains route to the right pool automatically. Scale from one machine to 25 to 1000+ in your own VPC.

  • CPU, GPU, and high-memory clusters defined as code.
  • Async I/O and checkpoint recovery built into every chain. Failures resume from the last successful stage, not from scratch.
  • Distributed compute is a property of the chain primitive, not a separate framework.

Cross-cloud, cross-region, on-prem

Multi-cloud and multi-region setups are first-class. Storage in one cloud, compute in another, results in a third, all linked through the same Dataset DB. Useful for collaboration across institutional boundaries (universities, hospitals, regulated industries) where data residency is non-negotiable.

  • Storage in your region, compute in any region you control.
  • Studio runs as a control plane only. Your data plane stays yours.
  • Audit trail in Dataset DB covers what ran where, by whom, on which version.

Your data plane stays yours. The control plane gets out of the way.