Most data tools force a translation step before research can start: copy from S3, reformat, load into a database, build indices, write a pipeline, get an engineer involved. DataChain skips the translation. Files in S3, GCS, and Azure become typed datasets through one Python call, and your team queries them at warehouse speed.
DataChain is the Context Layer for Unstructured Data: the operational substrate for files, documents, and multimodal records. Inside Claude Code, Cursor, and Codex it works as the data harness that pairs with the code harness, extending the same session into typed datasets, persistent storage, and compiled knowledge. Two 2026 papers measure it under Claude Code on a 1,500-document SEC 10-K corpus and a 1,500-image MS-COCO corpus.
Raw files in object storage, typed datasets in one call
The Dataset DB holds your data in place. File pointers live in typed rows under a Pydantic schema; bytes stay in your buckets. Filter, join, similarity search, and column reads run as warehouse-speed columnar queries over millions of records.
- Read storage in one line:
dc.read_storage("s3://bucket/..."). Filter, sort, sample, and group instantly. - Run heavy Python (LLM calls, model inference, multimodal extraction) per row through the Compute Engine: parallel, async I/O, checkpoint recovery.
- Access only the file parts your code needs. Stream a video clip, decode a single MRI slice, or read one Parquet column without pulling the whole file.
No ETL, no data reorganization, no waiting.
The cost of being right collapses two orders of magnitude per tier
Re-running the same LLM pass over a 500K-document corpus costs thousands of dollars and hours. Querying the materialized result costs fractions of a cent and runs sub-second. DataChain holds the gap open by design.
- Tier 0, raw files (~$100 per recall). Compute Engine over object storage. The only tier that can answer a never-before-asked question.
- Tier 1, datasets (~$1 per recall). Dataset DB over typed records. Filter, join, similarity, column reads.
- Tier 2, summaries (~$0.01 per recall). Knowledge Base markdown over schemas, lineage, and prior conclusions.
100,000,000× cheaper to recall than to recompute. Measured: enriching 3,000 multimodal files end-to-end (LLM, VLM, embedding passes) costs $249; querying the materialized result costs under one cent.
Velocity compounds, session after session
Every chain deposits a typed dataset into Dataset DB. The next session starts from that dataset, not from raw bytes.
- 8.4× faster, 3.4× cheaper on the next question. Follow-ups take seconds and cents instead of minutes and dollars.
- 5× compounding savings across five sessions. Each materialized dataset feeds the next.
- +40 percentage points higher pass rate on reuse-rich tasks. Agents that read prior typed datasets get the right answer up to 40 points more often than agents working from raw files.
Research speed comes from inventory, not from brute force.