News
Dmitry on Software Engineering Daily
Our CEO Dmitry Petrov was interviewed on the much-beloved Software Engineering Daily podcast! Host Jeff Meyerson kicked off the discussion:
Code is version controlled through Git, the version control system originally built to manage the Linux codebase. For decades, software has been developed using git for version control. More recently, data engineering has become an unavoidable facet of software development. It is reasonable to ask–why are we not version controlling our data?
For the rest of the episode, listen here!
Data Version Control with Dmitry Petrov
Contributor's meetup
Last week, we held a meetup for contributors to DVC! Core maintainer Ruslan Kupriev hosted a get-together for folks who contribute new features, bug fixes, and more to the community. If you missed it, you can watch it on YouTube.
New videos
We've released several new videos to our growing YouTube channel- and cool news, we passed 1,000 subscribers! The support has been surprising in the best way possible. We're seeing a lot of repeat commenters and folks from the DVC meetups! It's been so rewarding to get positive feedback from the community and we're planning to build our YouTube presence even more.
Even Skeletor finds joy in this.
We now have 4 tutorials in our MLOps series. In the latest, we cover how to use your own GPU (on-premise or in the cloud) to run GitHub Actions workflows. Check it out and give it a try, the code examples are freely available :)
We also made our first ever "explainer" video to talk through how DVC works in five minutes.
As always, video requests are welcome! Reach out and let us know what topics and tutorials you want to see covered. And we appreciate any likes, shares, and subscribes on our growing YouTube channel.
From the community
A three-part CML series (featuring R!)
DVC ambassador Marcel Ribeiro-Dantas has published two of three tutorial blogs in a series on CML! Marcel's use case is especially cool because he's using R, plus some causal modeling related to his work in bioinformatics, with GitHub Actions.
In Part I, Marcel introduces his project and how he uses DVC, CML and GitHub Actions together (with his custom R library).
Continuous Machine Learning - Part I
In Part II, Marcel takes a deeper dive into Docker. He explains how to create a your own Docker image and test it. This case should be helpful for folks who want to include the CML library in their own Docker container.
Continuous Machine Learning - Part II
Real Python talks DVC
Kristijan Ivancic of Real Python, a library of online Python tutorials and lessons, created a seriously impressive DVC tutorial (this thing is a beast 🐺- it has a table of contents!)
And, the Real Python podcast discussed their DVC tutorial (plus the joys of version control for data!) on a recent episode.
Episode 25: Data Version Control in Python and Real Python Video Transcripts
Recommended reading
There's a lot of cool stuff happening out there in the data science world 🌏!
- Fabiana Clemente, Chief Data Officer of YData, published a blog for The Startup about four reasons to start using data version control- and, with her expertise in data privacy, she's especially well-qualified to explain the role of DVC in compliance and auditing! Check out her blog (it comes with a quick-start tutorial, too).
4 reasons why data scientists should version data
- Ryzal Kamis at the AI Singapore Makerspace shared a blog (the first of two!) about creating end-to-end CI/CD workflows for machine learning. In his first blog, Ryzal gives a high-level overview of the need for data version control and compares several tools in the space. Then he gives a walkthrough (quite easy to follow!) of how DVC fits in his workflow. We're eagerly awaiting the second installment of this series, which promises to bring more advanced automation scenarios and a CI/CD pipeline.
Data Versioning for CD4ML
- Isaac Sacolick, contributing editor at InfoWorld, penned an article about the growing field of MLOps and its role in data-driven businesses. He writes:
Too many data and technology implementations start with poor or no problem statements and with inadequate time, tools, and subject matter expertise to ensure adequate data quality. Organizations must first start with asking smart questions about big data, investing in dataops, and then using agile methodologies in data science to iterate toward solutions.
Read the rest here:
MLops: The rise of machine learning operations
Thanks everyone, that's a wrap for this month. Be safe, stay in touch, and get ready for pumpkin spice latte season 🎃.