No results found for query ""

Search by

DataChain Blog

Find here DataChain and DVC news, findings, interesting reads, community takeaways, deep dive into machine learning workflows from data versioning and processing to model productionization.

From Big Data to Heavy Data: Rethinking the AI Stack

LLMs can finally interpret unstructured video, audio, and documents — but they can't do it alone. This post introduces the concept of heavy data and explores how modern teams build multimodal pipelines to turn it into AI-ready data.

Dmitry Petrov
Jun 09, 2025 • 3 min read

As GenAI Fever Fades - Time to Prioritize Robust Engineering Over Overblown Promises

Improved Engineering and Data Management will be what carries GenAI into maturity

Dmitry Petrov
Oct 23, 2024 • 3 min read

Scalable PDF Document Processing with DataChain and Unstructured.io

Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code).

Tibor Mach
Sep 30, 2024 • 7 min read

Post-modern AI Data Stack

How and Why Generative AI will change the modern data stack.

Daniel Kharitonov
Sep 24, 2024 • 7 min read

You Do the Math: Fine Tuning Multimodal Models (CLIP) to Match Cartoon Images to Joke Captions

Learn how to fine tune multimodal models like CLIP to match images to text captions.

Dave Berenbaum
Sep 12, 2024 • 9 min read

Enforcing JSON Outputs in Commercial LLMs

The results of our tests on the structured outputs of Google Gemini Pro, Anthropic Claude, and OpenAI GPT. DataChain used for evaluation.

Daniel Kharitonov
Sep 06, 2024 • 10 min read

Announcing DataChain

Introducing DataChain - a new open-source tool to curate and process unstructured data using local ML models, and LLM calls.

Dmitry Petrov
Jul 23, 2024 • 4 min read

Ready to get started?

Sign up Start for free