Product updates, news, tutorials, integrations, and deep dives.

Tutorial

View all posts
Scalable PDF Document Processing with DataChain and Unstructured.io
Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code).
  • Tibor Mach
  • Sep 30, 20247 min read
You Do the Math: Fine Tuning Multimodal Models (CLIP) to Match Cartoon Images to Joke Captions
Learn how to fine tune multimodal models like CLIP to match images to text captions.
  • Dave Berenbaum
  • Sep 12, 20249 min read
Announcing DataChain
Introducing DataChain - a new open-source tool to curate and process unstructured data using local ML models, and LLM calls.
  • Dmitry Petrov
  • Jul 23, 20244 min read

DataChain

View all posts
OpenAI's Data Agent and the S3 Gap
OpenAI built their in-house data agent for structured warehouse data, where schema, lineage, and queries come for free. Files in S3, GCS, or Azure - videos, sensor logs, image corpora, PDFs - have none of that, and the problems get a lot more interesting. Here is how we built the four foundations that close the gap.
  • Dmitry Petrov
  • May 07, 202610 min read
As GenAI Fever Fades - Time to Prioritize Robust Engineering Over Overblown Promises
Improved Engineering and Data Management will be what carries GenAI into maturity
  • Dmitry Petrov
  • Oct 23, 20243 min read
Scalable PDF Document Processing with DataChain and Unstructured.io
Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code).
  • Tibor Mach
  • Sep 30, 20247 min read