November ’19 DVC❤️Heartbeat
Every month we are sharing here our news, findings, interesting reads, community takeaways, and everything along the way. Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.
How cool is this handmade swag from our community? We were in tears!
The past few months have been so busy and full of great events! We love how involved our community is and can’t wait to share more with you:
-
We have organized our very first meetup! So many great conversations, new use cases and insights! Many thanks to Dan Fischetti from Standard Cognition, who joined our Dmitry Petrov on stage. Watch the recording here.
-
Hacktoberfest was a great exercise for DVC team on many levels and we really enjoyed supporting new contributors. Kudos to Nabanita Dash for organizing a cool DVC-themed hackathon!
Our open source event Hacktoberfest-themed meet-up was a success. Thanks to @DVCorg and it's mentors for all the hard work.
— Programming Society IIIT-Bh (@psociiit) October 18, 2019
Some of our attendees made their first PR on DVC and got them merged. Kudos to the team!
PS: 🍕 was the second best thing of the evening. pic.twitter.com/zAWC0TVlPd -
We’ve crossed 4k stars mark on Github!
-
DVC was participating in the Devsprints (Thank you Kurian Benoy for the intro!) and we were happy to jump in and help with some mentoring.
Thank you @DVCorg for participating in the Devsprints, by @FossMEC and @excelmec. We had @shcheklein who joined us all the way from SF and explained how open source is boosting the future. Srinidhi and @kurianbenoy2 helped participants get started to contributing to the project.
— FOSS MEC (@FossMec) November 8, 2019
Devsprints participants on our Discord channel
-
DVC became part of the default Homebrew formulae! So now you can install it as easy as
brew install dvc
! -
We helped 2 aspiring speakers deliver their very first conference talks. Kurian Benoy was speaking at PyconIndia and Aman Sharma was speaking at SciPyIndia. Supporting speakers is something we are passionate about and if you ever wanted to give a talk on a DVC-related topic — we are here to help, just let us know!
-
Our own Dmitry Petrov went to Europe to speak at the Open Source Summit Europe in Lyon, Highload++ in Moscow and made a stop in in Berlin to co-host a meetup with our favourite AI folks from Codecentric!
Here are some of the great pieces of content around DVC and ML ops that we discovered in October and November:
- Deploy Machine Learning Models with Django by Piotr Płoński.
…building your ML system has a great advantage — it is tailored to your needs. It has all features that are needed in your ML system and can be as complex as you wish. This tutorial is for readers who are familiar with ML and would like to learn how to build ML web services.
Deploy Machine Learning Models with Django
In this article, I want to show 3 powerful tools to simplify and scale up machine learning development within an organization by making it easy to track, reproduce, manage, and deploy models.
How to Manage Your Machine Learning Workflow withDVC, Weights & Biases, and Docker
We do believe that Data Science is a field that can become even more mature by using best practices in project development and that Conda, Git, DVC, and JupyterLab are key components of this new approach
Creating a solid Data Science development environment
DVC is a powerful tool and we covered only the fundamentals of it.
Creating reproducible data science workflows with DVC
Discord gems
There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.
We are sifting through the issues and discussions and share with you the most interesting takeaways.
Q: When you do a dvc import
you get the state of the data in the original repo at that moment in time from that repo, right? The overall state of that repo (e.g. Git commit id
(hash)) is not preserved upon import, right?
On the contrary, DVC relies on Git commit id
(hash) to determine the state of
the data as well as code. Git commit id
(hash) is saved in DVC file upon
import, data itself is copied/downloaded into DVC repo cache but would not be
pushed to the remote — DVC does not create duplicates. There is a command to
advance/update it when it’s needed — dvc update
. Git commit hash saved to
provide reproducibility. Even if the source repo HEAD
has changed your import
stays the same until you run dvc update
or redo dvc import
.
Q: I’m trying to understand if DVC is an appropriate solution for storing data under GDPR requirements. That means that permanent deletion of files with sensitive data needs to be fully supported.
Yes, in this sense DVC is not very different from using bare S3, SSH or any other storage where you can go and just delete data. DVC can give a bit of overhead to locate a specific file to delete, but otherwise it’s all the same you will be able to delete any file you want. Read more details in this discussion.
Q: Is there anyway to get the remote url for specific DVC-files? Say, I have a DVC-file foo.png.dvc
— is there a command that will show the remote url, something like dvc get-remote-url foo.png.dvc
which will return e.g. the Azure url to download.
There is no special command for that, but if you are using Python, you could use our API specifically designed for that:
from dvc.api import get_url
url = get_url(path,
repo="https://github.com/user/proj",
rev="mybranch")
so, you could as well use this from CLI as a wrapper command.
Q: Can DVC be integrated with MS Active Directory (AD) authentication for controlling access? The GDPR requirements would force me to use such a system to manage access.
Short answer: no (as of the date of publishing this Heartbeat issue) Good news — it should be very easy to add, so we would welcome a contribution :) Azure has a connection argument for AD — quick googling shows this library, which is what probably needed.
Q: How do I uninstall DVC from Mac installed as a package?
When installing using plain.pkg
it is a bit tricky to uninstall, so we usually
recommend using things like brew cask instead if you really need the binary
package. Try to run these commands:
$ sudo rm -rf /usr/local/bin/dvc
$ sudo rm -rf /usr/local/lib/dvc
$ sudo pkgutil --forget com.iterative.dvc
to uninstall the package.
Q: We are using SSH remote to store data, but the problem is that everyone within the project has different username on the remote machine and thus we cannot set it in the config file (that is committed to Git). Is there a way to add just host and path, without the username?
Yes, you should use --local
or --global
config options to set user per
project or per use machine without sharing (committing) them to Git:
$ dvc remote modify myremote —local user myuser
or
$ dvc remote modify myremote —global user myuser
Q: I still get the SSL ERROR
when I try to perform a dvc push with or without use_ssl = false
?
A simple environment variable like this:
$ export AWS_CA_BUNDLE=/path/to/cert/cert.crt dvc push
should do the trick for now, we plan to fix the ca_bundle option soon.
Q: I have just finished a lengthy dvc repro
and I’m happy with the result. However, I realized that I didn’t specify a dependency which I needed (and obviously is used in the computation). Can I somehow fix it?
Add the dependency to the stage file without rerunning/reproducing the stage. This is not needed as this additional dependency hasn’t changed.
You would need to edit the DVC-file. In the deps section add:
-path: not/included/file/path
and run dvc commit file.dvc
to save changes w/o running the pipeline again.
See an example
here.
Q: For some reason we need to always specify the remote name when doing a dvc push
e.g., dvc push -r upstream
as opposed to dvc push
(mind no additional arguments).
You can mark a “default” remote:
$ dvc remote add -d remote /path/to/my/main/remote
then, dvc push
(and other commands like dvc pull
) will know to push to the
default
Q: If I want stage B to run after stage A, but the stage A has no output, can I specify A’s DVC-file as B’s dependency?
No, at least at the time of publishing this. You could use a phony output though. E.g. make the stage A output some dummy file and make B depend on it. Please, consider creating or upvoting a relevant issue on our Github if you’d this to be implemented.
Q: I’m just getting started with DVC, but I’d like to use it for multiple developers to access the data and share models and code. I do own the server, but I’m not sure how to use DVC with SSH remote?
Please, refer to
this answer
on the DVC forum and check the documentation for the
dvc remote add
and
dvc remote modify
commands to see more options and details.
If you have any questions, concerns or ideas, let us know in the comments below or connect with DVC team here. Our DMs on Twitter are always open, too.