Here are some of our top Q&A's from around the community. With the launch of CML earlier in the month, we've got some new ground to cover!
DVC questions
Q: What's the relationship between the DVC remote and cache? If I have an external cache, do I really need a DVC remote?
You can think of your DVC remote similar to your Git remote, but for data and model artifacts- it's a place to backup and share artifacts. It also gives you methods to push and pull those artifacts to and from your team.
Your DVC cache (by default, it's located in .dvc/cache
) serves a similar
purpose to your Git objects database (which is by default located in
.git/objects
). They're both local caches that store files (including various
versions of them) in a content-addressable format, which helps you quickly
checkout different versions to your local workspace. The difference is that
.dvc/cache
is for data/model artifacts, and .git/objects
is for code.
Usually, your DVC remote is a superset of .dvc/cache
- everything in your cache
is a copy of something in your remote (though there may be files in your DVC
remote that are not in your cache (and vice versa) if you have never attempted
to push
or pull
them locally).
In theory, if you are using an external cache- meaning a DVC cache configured on a separate volume (like NAS, large HDD, etc.) outside your project path- and all your projects and all your teammates use that external cache, and you know that the storage is highly reliable, you don't need to also have a DVC remote. If you have any doubts about access to your external cache or its reliability, we'd recommend also keeping a remote.
Q: One of my files is an output of a DVC pipeline, and I want to track this file with Git and store it in my Git repository since it isn't very big. How can I make this work?
Yes! There are two approaches. We'll be assuming you have a pipeline stage that
outputs a file, myfile
.
- If you haven't declared the pipeline stage with
dvc run
yet, then you'll do it like this:
$ dvc run -n <stage name> -d <dependency> -O myfile
Note that instead of using the flag -o
for specifying the output myfile
,
we're using -O
- it's shorthand for --outs-no-cache
. You can
read about this flag in our docs.
- If you've already created your pipeline stage, go into your
dvc.yaml
and manually add the fieldcache: false
to the stage as follows:
outs:
- myfile:
cache: false
Please note one special case: if you previously enabled hardlinks or symlinks in
DVC via dvc config cache
, you may need to run dvc unprotect myfile
to fully
unlink myfile
from your DVC cache. If you haven't enabled these types of file
links (and if you're not sure, you probably didn't!), this step is unncessary.
See our docs for more.
Q: Can I change my params.yaml
file to a .json
?
Yes, this is straightforward- you change your params.yaml
to params.json
in
your workspace, and then use it in dvc run
:
$ dvc run -p params.json:myparam ...
Alternately, if your pipeline stage has already been created, you can manually
edit your dvc.yaml
file to replace params.yaml
with params.json
.
For more about the params.yaml
file,
see our docs.
Q: Is there a guide for migrating from Git-LFS to DVC?
We don't know of any published guide. One of our users shared their procedure for disabling LFS:
$ git lfs uninstall
$ git rm .gitattributes
$ git rm .lfsconfig
Then you can dvc add
files you wish to put in DVC tracking, and dvc push
them to your remote. After that, git commit
and you're good!
Note that, if you're going to delete any LFS files, make sure you're certain the corresponding data has been transferred to DVC.
Q: Is there a way to use DVC and CML to validate a model in a GitHub Action, without making the validation data available to the user opening the Pull Request?
We don't have special support for this use case, and there may be some security downsides to using a confidential validation dataset with someone else's code (be sure nothing in their code could expose your data!). But, there are ways to implement this if you're sure about it.
One possible approach is to create a separate "data registry" repository using a
private cloud bucket to store your validation dataset
(see our docs about the why and how of data registries).
Your CI system can be setup to have access to the data registry via secrets
(called "variables" in GitLab). Then when you run validation via
dvc repro validate
, you could use dvc get
to pull the private data from the
registry.
The data is never exposed to the user in an interactive setting, only on the runner- and there it's ephemeral, meaning it does not exist once the runner shuts down.
CML questions
Q: Sometimes when I make a commit on a branch, my CI workflow isn't triggered. What's going on?
If your workflow is set to trigger on a push (as in the CML use cases), it isn't
enough to git commit
locally- you need to push to your GitHub or GitLab
repository. If you want every commit to trigger your workflow, you'll need to
push each one!
What about if you don't want a push to trigger your worfklow? In GitLab, you
can use the
[ci skip]
flag- make sure
your commit message contains [ci skip]
or [skip ci]
, and GitLab CI won't run
the pipeline in your gitlab-ci.yml
file.
In GitHub Actions, this flag isn't supported, so you can manually kill any workflows in the Actions dashboard. For a programmatic fix, check out this workaround by Tim Heuer.
Q: Can I do the bulk of my model training outside of my CI system, and then share the result with CML?
Definitely! This is a desirable workflow in several cases:
- You have a preferred approach for experiment tracking (for example, DVC or MLFlow) that you want to keep using
- You don't want to set up a self-hosted runner to connect your computing resources to GitHub or GitLab
- Training time is on the order of days or more
CML is very flexible, and one strong use case is for sanity checking and evaluating a model in a CI system post-training. When you have a model that you're satisifed with, you can check it into your CI system and use CML to evaluate the model in a production-like environment (such as a custom Docker container), report its behavior and informative metrics. Then you can decide if it's ready to be merged into your main branch.
Q: Can I make a CML report comparing models across different branches of a project?
Definitely. This is what dvc metrics diff
is for- like a git diff
, but for
model metrics instead of code. We made a video about how to do this in CML!
Q: In the function cml publish
, it looks like you're uploading published files to https://asset.cml.dev
. Why don't you just save images in the Git repository?
If an image file is created as part of your workflow, it's ephemeral- it doesn't exist outside of your CI runner, and will disappear when your runner is shut down. To include an image in a GitHub or GitLab comment, a link to the image needs to persist. You could commit the image to your repository, but typically, it's undesireable to automatically commit results of a CI workflow.
We created a publishing service to help you host files for CML reports. Under the hood, our service uploads your file to an S3 bucket and uses a key-value store to share the file with you.
This covers a lot of cases, but if the files you wish to publish can't be shared
with our service for security or privacy reasons, you can emulate the
cml publish
function with your own storage. You would push your file to
storage and include a link to its address in your markdown report.