DVC questions
Q: Is there an equivalent of git restore <file>
for DVC?
Yes! You'll want dvc checkout
. It restores the corresponding verion of your
DVC-tracked file or directory from
the cache
to your local workspace.
Read up in our docs for more info!
Q: My dataset is made of more than a million small files. Can I use an archive format, like tar.gz
with DVC?
There are some downsides to using archive formats, and often we discourage it- but let's review some factors to consider, so you can make the best choice for your project.
- If your
tar.gz
file changes at all- perhaps because you changed a single file before zipping- you'll end up with an entirely new copy of the archive every time you commit! This is not very space efficient, but if space isn't an issue it might not be a dealbreaker. - Because of the way we optimize data transfer, you'll end up transferring the
whole archive anytime you modify a single file and
dvc push
/dvc pull
. - In general, archives don't play nice with the concept of diffs. Looking back at your git history, it can be challenging to log how files were deleted, modified, or added when you're versioning archives.
While we can't do much about the general issues that archives present for
version control systems, DVC does have some options that might help you achieve
better data transfer speeds. We recommend exploring DVC's built-in parallelism-
data transfer functions like dvc push
and dvc pull
have a flag (-j
) for
increasing the number of jobs run simultaneously.
Check out the docs for more details.
In summary, the advantage of using an archive format will depend on both how often you modify your dataset and how often you need to push and pull data. You might consider exploring both approaches (with and without compression) and run some speed tests for your use case. We'd love to know what you find!
Q: My DVC remote is a server with a self-signed certificate. When I push data, DVC is giving me an SSL verification error- how can I get around this?
On S3 or S3-compatible storage, you can configure your AWS CLI to use a custom
certificate path.
As suggested by their docs,
you can also set the environment variable AWS_CA_BUNDLE
to your .pem
file.
Similarly, on HTTP and Webdav remotes, there's REQUESTS_CA_BUNDLE
environment
variable that you can set your self-signed certificate file to.
Then, when DVC tries to access your storage, you should be able to get past SSL verification!
Q: I want to be able to make my own plots in Python with data points from my dvc plots
, including older versions of those plots. What do you recommend to get the raw historical data?
We suggest
from git import Repo
revs = Repo().plots.collect(revs=revs)
Then you can plot the data contained in revs
to your heart's content!
Q: Is it safe to share a DVC remote between two projects or registries?
You can share a remote with as many projects as you like. Because DVC uses content-addressible storage, you'll still get benefits like file deduplication over every project that uses the remote. This can be useful if you're likely to have many shared files across projects.
One big thing to watch out for: you have to be very careful with clearing the
DVC cache. Make sure you don't remove files associated with another project when
running dvc gc
by using the --projects
flag.
Read up in the docs!
Q: Can I throttle the number of simultaneous uploads to remote storage with DVC?
Yep! That'll be the -j/--jobs
flag, for example:
$ dvc push -j <number>
will control the number of simultaneous uploads DVC attempts when pushing files to your remote storage (see more in our docs).
CML questions
Q: I have a DVC pipeline that I want to run in CI/CD. Specifically, I only want to reproduce the stages that have changed since my last commit. What do I do?
DVC pipelines, like makefiles, will only reproduce stages that DVC detects have changed since the last commit. So to do this in CI/CD systems like GitHub Actions or GitLab CI, you'll want to make sure the workflow a) syncs the runner with the latest version of your pipeline, including all inputs and dependencies, and b) reruns your DVC pipeline.
In practice, your workflow needs to include these two commands:
$ dvc pull
$ dvc repro
You pull the latest version of your pipeline, inputs and dependencies from cloud
storage with dvc pull
, and then dvc repro
intelligently reproduces the
pipeline (meaning, it should avoid rerunning stages that haven't changed since
the last commit).
Check out an example workflow here.
Q: I'm using DVC and CML to pull data from cloud storage, then train a model. I want to push the trained model into cloud storage when I'm done, what should I do?
One approach is to run
$ dvc add <model>
$ dvc push <model>
to the end of your workflow. This will push the model file, but there's a downside: it won't keep a strong link between the pipeline (meaning, the command you used to generate the model and any code/data dependencies) and the model file.
What we recommend is that you create a DVC pipeline with one stage- training your model- and declaring your model file as an output. Then, your workflow can look like this:
# get data
$ dvc pull --run-cache
# run the pipeline
$ dvc repro
# push to remote storage
$ dvc push --run-cache
When you do this workflow with the --run-cache
flags, you'll be able to save
all the results of the pipeline in the cloud
(read more here). When the
run has completed, you can go to your local workspace and run:
$ dvc pull --run-cache
$ dvc repro
This will put your model in your local workspace! And, you get an immutable link between the code version, data version and model you end up with.
We recommend this approach so you don't lose track of how model files relate to the data and code that produced them. It's a little more work to set up, but Future You will thank you!