Happy Halloween from Pirate DeeVee!
DVC questions
Q: What's in a .dvc
file, and what would happen if decided not push my .dvc
files to my Git repo?
DVC creates lightweight metafiles (.dvc
files) that correspond to large
artifacts in your project. These .dvc
files contain pointers to your artifacts
in remote storage (we use a simple content-based storage scheme). Because we use
content-based storage, the remote storage itself isn't designed for browsing
(although
there are some discussions about
how to make stored files more "discoverable", and you can always identify them
manually by their contents and meta-information like timestamps).
Your .dvc
files help establish meaningful links between human-readable
filenames and file contents in remote storage, as well as to use Git versioning
on your stored datasets and models. You can think of your DVC remote storage as
a compliment to your Git repository, not a replacement.
In other words⦠if you're not Git versioning your .dvc
files, you're not
versioning anything in DVC remote storage!
Q: Can I limit the number of network connections used by DVC during dvc pull
?
Yep- by default, DVC data transfer operations use a number of threads
proportional to the number of CPUs detected. But, there's a handy flag for
dvc pull
and dvc push
that lets you override the defaults:
-j <number>, --jobs <number> - number of threads to run
simultaneously to handle the downloading of files from
the remote. The default value is 4 * cpu_count(). For
SSH remotes, the default is just 4. Using more jobs may
improve the total download speed if a combination of small
and large files are being fetched.
Q: I'm working on a multi-class classification task. Can dvc plots
show multiple precision recall curves- one for each class?
Currently, dvc plots
doesn't support multiple linear curves on a single plot
(except for dvc plots diff
, of course!). But, you could make one precision
recall curve per class and display them side-by-side.
To do this, you'd want to write the precision recall curve values to separate
files for each class (prc-0.json
,prc-1.json
, etc.). Then you would run:
$ dvc plots show prc-0.json prc-1.json
And you'll see two plots side-by-side! A benefit of this approach is that when
you run dvc plots diff
to compare precision recall curves across Git commits,
you'll get a comparison plotted for each class.
Q: Are you sure I should commit my .dvc/config
file? It contains my logging credentials for storage, and I'm nervous about adding it to a shared Git repository.
This is a common scenario- you don't necessarily want to broadcast your remote
storage credentials to everyone on your team, but you still want to check-in
your DVC setup (meaning, your .dvc/config
file). In this case, you want to use
a local
config file!
You can use the command
$ dvc config --local
to setup remote credentials that will be stored in .dvc/config.local
- by
default, this file is in your .gitignore
so you don't have to worry about
accidentally committing secrets to your Git repository.
Check out the docs for more,
including the --system
and --global
options for setting your configuration
for multiple projects and users respectively.
CML Questions
Q: What's the file size limit for publishing files with cml publish
?
cml publish
is a service for hosting files that are embedded in CML reports,
like images, audio files, and GIFS. By default, we have a limit of 2 MB per
upload.
If your files are larger than this (which can happen, depending on the machine
learning problem you're working on!) we recommend using GitLab's artifact
storage.
Based on discussions in the community,
we recently implemented a CML flag (--gitlab-uploads
) to streamline the
process:
$ cml publish movie.mov --md --gitlab-uploads > report.md
Note that we don't currently have an analagous solution for GitHub, because GitHub artifacts expire after 90 days (whereas they're permanent in GitLab).
Q: I'm getting a mysterious error message, Failed guessing mime type of file
, when I try to use cml publish
. What's going on?
This error message usually means that the target of cml publish
- for example,
$ cml publish <target file>
is not found. Check for typos in the target filename and ensure that the file was in fact generated during the run (if it isn't part of your Git repository). We've opened an issue to add a more informative error message in the future.
Q: In my GitHub Actions workflow, I use dvc metrics diff
to compare metrics generated during the run to metrics on the main branch and print a table- but the table isn't showing any of the metrics from main
. What could be happening?
When a continuous integration runner won't report metrics from previous versions of your project (or other branches), that's usually a sign that the runner doesn't have access to the full Git history of your project or your metrics themselves. Here are a few things to check for:
- Did you fetch your Git working tree in the runner? Functions like
dvc metrics diff
require the Git history to be accessible- make sure that in your workflow, before you run this function, you've done agit fetch
. We recommend:
$ git fetch --prune --unshallow
-
Are your metrics in your DVC remote? If your metrics are cached (which they are by default when you create a DVC pipeline), your DVC remote should be accessible to your runner. That means you need to add any credentials as repository secrets (or variables, in GitLab), and do
dvc pull
in your workflow before attemptingdvc metrics diff
. -
Are your metrics in your local workspace? If you are not using a DVC remote, your metric files must be uncached and committed to your Git repository. To explore an example, say you have a pipeline stage that creates
metric.json
:
$ dvc run -n mystage -m metric.json train.py
By default, metric.json
is cached and ignored by Git- which means that if you
aren't using a DVC remote in your CI workflow, metric.json
will effectively be
abandoned on your local machine! You can avoid this by using the -M
flag
instead of -m
in dvc run
, or manually adding the field cache: false
to
your metric in dvc.yaml
. Be sure to remove your metrics from any .gitignore
files, and commit and push them to your Git repository.
That's all for this month- Happy Halloween! Watch out for scary bugs. π