DataChain Blog

Find here DataChain news, findings, interesting reads, community takeaways, deep dive into machine learning workflows from data versioning and processing to model productionization.

OpenAI's Data Agent and the S3 Gap
OpenAI built their in-house data agent for structured warehouse data, where schema, lineage, and queries come for free. Files in S3, GCS, or Azure - videos, sensor logs, image corpora, PDFs - have none of that, and the problems get a lot more interesting. Here is how we built the four foundations that close the gap.
  • Dmitry Petrov
  • May 07, 202610 min read
The Neuro-Data Bottleneck: Why Neuro-AI Interfacing Breaks the Modern Data Stack
Neural data like EEG and MRI is never 'finished' - it's meant to be revisited as new ideas and methods emerge. Yet most teams are stuck in a multi-stage ETL nightmare. Here's why the modern data stack fails the brain.
  • Dmitry Petrov
  • Jan 23, 20265 min read
Parquet Is Great for Tables, Terrible for Video - Here's Why
Parquet is great for tables, terrible for images and video. Here's why shoving heavy data into columnar formats is the wrong approach - and what we should build instead. Hint: it's not about the formats, it's about the metadata.
  • Dmitry Petrov
  • Sep 03, 20255 min read
From Big Data to Heavy Data: Rethinking the AI Stack
LLMs can finally interpret unstructured video, audio, and documents — but they can't do it alone. This post introduces the concept of heavy data and explores how modern teams build multimodal pipelines to turn it into AI-ready data.
  • Dmitry Petrov
  • Jun 09, 20253 min read
As GenAI Fever Fades - Time to Prioritize Robust Engineering Over Overblown Promises
Improved Engineering and Data Management will be what carries GenAI into maturity
  • Dmitry Petrov
  • Oct 23, 20243 min read
Scalable PDF Document Processing with DataChain and Unstructured.io
Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code).
  • Tibor Mach
  • Sep 30, 20247 min read
Post-modern AI Data Stack
How and Why Generative AI will change the modern data stack.
  • Daniel Kharitonov
  • Sep 24, 20247 min read