Announcing DataChain

We are introducing DataChain - a new open-source tool that greatly complements DVC to data preparation and dataset curation via local ML models and LLM API calls.

  • Dmitry Petrov
  • July 23, 20244 min read
Hero Picture

DataChain: ⭐ the repo at https://github.com/iterative/datachain

As a DVC team, we are fortunate to witness the recent transformations in ML and AI firsthand, and admire the rise of unstructured data together with great libraries to handle it.

The tidal wave of new applications demanding access to images, video, audio, text, PDF documents, MRI scans, and other media types introduces a new challenge - unstructured data preparation and curation. It is no longer enough to designate and version a file folder as a dataset; the modern data stack demands the use of AI to create features and annotations and start the dataset formation from these features.

AI’s new appetite for data

While data has long been touted to be in the critical path for building AI, below are the novel requirements that we believe are here to stay:

🤖 AI-Driven Data Curation: There is a rapidly growing interest in data curation assisted with AI models. Technologies like LLMs judging their own outputs, or OpenAI’s DALL-E3 curating its own dataset are becoming a common practice, and frequently separate the good products from the great products.

🚀 Dataset scaling: In many GenAI use cases, millions and billions of images or text snippets are becoming routine. Wrangling and versioning the GenAI-era datasets takes an entirely new approach at the backend.

🐍 Python ubiquity: As Python emerged as a clear language of choice for AI workloads, it became painfully obvious that Python objects must be treated as 1st class citizens in the AI datasets. Using JSON to store every possible feature just no longer cuts it.

What DataChain can do for you.

We created DataChain to answer these challenges. Our high-level vision of serving the modern AI data stack drives the key product capabilities:

  1. Read data from cloud S3/GCS/Azure (or locally) and create persistent and versioned datasets with samples defined as sparse references to files or objects inside files. Examples are features within parquet or .tar archives, tiles within the bounding boxes in images, snippets within text blobs, etc.
  2. Define data models in Python using Pydantic and store features as validated data objects with automatic serialization/deserialization.
  3. Apply transformations to data using local ML models, external LLM calls, or custom Python code coupled with ability to chain and optimize stages for execution.
  4. Run inference code efficiently in parallel and out-of-memory, handling millions of files even in a laptop.
  5. Use embedded databases under the hood (SQLite in open-source version) to efficiently store Python objects and execute vectorized operations (such as similarity search) and analytical calls. DataChain hides all the complexity of dealing with databases.

The result is a modern, Python-friendly, data-wrangling abstraction that can compose actions like AI inferences and metric computations at scale, and doubles as a building block to your data curation routine, or as a way to evaluate an existing AI application.

We believe that DataChain will serve as a solid foundation for new and upcoming unstructured data-wrangling libraries, as well as custom AI-driven curation solutions.

Typical use examples

$ pip install datachain

LLM judging LLMs dialogues

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatCompletionResponse, ChatMessage
from datachain import File, DataChain

PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."

def eval_dialogue(file: File) -> ChatCompletionResponse:
     client = MistralClient()
     return client.chat(
         model="open-mixtral-8x22b",
         messages=[ChatMessage(role="system", content=PROMPT),
                   ChatMessage(role="user", content=file.read())])

chain = (
   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
   .map(response=eval_dialogue)
   .save("mistral_dataset")
)

Auto-deserializing LLM-responses to Pydantic

from datachain import DataChain, DataModel
from mistralai.models.chat_completion import ChatCompletionResponse

DataModel.register(ChatCompletionResponse)
chain = DataChain.from_dataset("mistral_dataset")

# Iterating one-by-one: out of memory
for file, response in chain.iterate("file", "response"):
    # You work with Python objects
    assert isinstance(response, ChatCompletionResponse)

    status = response.choices[0].message.content
    print(f"{file.get_uri()}: {status}")

Output:

gs://datachain-demo/chatbot-KiT/1.txt: Success
gs://datachain-demo/chatbot-KiT/10.txt: Success
gs://datachain-demo/chatbot-KiT/11.txt: Failure
...

Vectorized operations over Pydantic data fields

In some cases, deserialization into a Python object can be skipped, and the computation can be executed directly within the underlying database.

# input tokens price: $2 per 1M tokens
# output tokens proce: $6 per 1M tokens
cost = chain.sum("response.usage.prompt_tokens")*0.000002 \
      + chain.sum("response.usage.completion_tokens")*0.000006

print(f"Spent ${cost:.2f} on {chain.count()} calls")

Output:

Spent $0.08 on 50 calls

Annotating cloud images with a local model

Computer Vision use cases are especially sensitive to high performance data downloads.

import io
from PIL import Image
from datachain import DataChain
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor

model = PaliGemmaForConditionalGeneration.from_pretrained("google/paligemma-3b-mix-224")
processor = AutoProcessor.from_pretrained("google/paligemma-3b-mix-224")

def process_image(file) -> str:
   image = Image.open(io.BytesIO(file.read())).convert("RGB")
   inputs = processor(text="caption", images=image, return_tensors="pt")
   generate_ids = model.generate(**inputs, max_new_tokens=100)
   return processor.batch_decode(generate_ids, skip_special_tokens=True,
                                 clean_up_tokenization_spaces=False)[0]

chain = (
      DataChain.from_storage("gs://datachain-demo/newyorker_caption_contest/images")
      .limit(5)
      .map(scene=process_image)
      .save("image_captions")
)

Optimizations: parallelization and data caching

Parallel execution and data caching plays a critical role in the efficient data curation process.

chain = (
   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/",
                          object_name="file")
   .settings(parallel=4, cache=True)
   .limit(5)
   .map(response=eval_dialogue)
   .save("mistral_dataset")
)

DataChain needs your feedback!

As usual in the open-source, we are providing this software for free, but we are also dependent on help from our community.

First and foremost, we want you to try the product, let us know if it fits your data routine, and report any bugs or deficiencies that you may have seen. DataChain is still early in its lifecycle, so your feedback is crucial to the future of this product.

Second, we need contributors! If you see a missing feature or an application for DataChain that could be built as an extension, we would be happy to see a pull request from you.

Last, but not least – if you find DataChain useful, give us a star ⭐ It goes a long way for us.

⭐️ Star the repo!

Back to blog