DVC questions
Q: Is there a way to plot all columns in a .csv
file on a single graph using dvc plot
?
By default, dvc plot
graphs one or two columns from the metric file of your
choice (use the -x
and -y
flags to specify which columns).
However, there's nothing special about the way DVC makes plots. The plot
function is a wrapper for the Vega-Lite
grammar, which can make pretty much any kind of plot you can imagine. If you
check inside .dvc/plots/
, you'll see a few Vega-Lite template files- that's
where the plotting instructions are stored!
You can create your own, or modify the existing templates, by
following the instructions in our docs.
In short, you'll create a new template and then run
dvc plot show -t <name-of-template>
to use it!
Vega-Lite has an interactive template editor online, which might help you test out ideas. Happy creating, and if you come up with a template you'd like to share with the DVC community, consider opening a pull request!
Q: My teammate and I are having some issues keeping our workplaces synced. We're tracking some folders with DVC, and he recently added a new file to each of these folders. How does he update the tracked folder and push the new contents so I can access them, too?
Your partner should first run
$ dvc add <folder>
$ dvc push
to update DVC about the new file and then push its contents to remote storage. Next, they'll run:
$ git commit <folder>.dvc
$ git push
to update your shared Git repository. Then you can do a git pull
and
dvc pull
to sync the changes with your local workspace!
Q: I forgot to declare a metric output in my dvc.yaml
file, so one of my metrics is currently untracked. How can I fix this without rerunning the stage? It takes a long time to run.
No problem- what you'll want to do is edit your dvc.yaml
case and then run
dvc commit dvc.yaml
to store the change.
dvc commit
is a helpful function that updates your dvc.lock
file and .dvc
files as needed, which forces DVC to accept any modifications to tracked data
currently in your workspace. That should cover the case where you have a metric
file from your last pipeline run in your workspace, but forgot to add it to the
dvc.yaml
as an output!
Check out the docs for
more about dvc commit
and how it can help you edit pipeline dependencies as
you work.
Q: Can I have multiple dvc.yaml
files?
Yes. The catch is that they have to be in separate directories. For example, you
can define independent pipelines in a dvc.yaml
file each. It's also possible
to spread a single pipeline into more than one dvc.yaml
file. DVC analyzes all
of them to rebuild the DAG(s), for example during dvc repro
.
Q: I want to work on my DVC pipeline on a different computer than usual. For the stage I'm developing, I don't need access to all the data dependencies of the earlier stages- is there a way to download only what I need?
Say for example that you have a pipeline like this:
+----------+
| data.dvc |
+----------+
*
*
*
+----+
| s1 |
+----+
*
*
*
+----+
| s2 |
+----+
*
*
*
+----+
| s3 |
+----+
where stage s2
is frozen (meaning, its dependencies will not change and we can
be reasonably sure the outputs of s2
are static).
To work on stage s3
in a new workspace, you could run:
$ dvc pull s2
$ dvc repro s3
This set of commands will pull only the targeted stage (not the data
corresponding to data.dvc
), and then execute the final stage of your pipeline
only.
CML questions
Q: Why do you need Docker to run CML?
Even though we use Docker in many of our tutorials, you technically don't need it at all! Here's what's going on:
We use a custom Docker container that comes with the CML functions installed (as well as some useful data science tools like Python, Vega-Lite, and CUDA drivers). If you want to use your own Docker container, that's fine too- just make sure you install the CML library of functions on your runner.
To install CML as an npm
package on your runner, we recommend:
npm i -g @dvcorg/cml
Once this is done, you should be able to execute functions like cml publish
and cml send-comment
on your runner.
For more tips about using CML without Docker, see our docs.
Q: I'm using CML to print a dvc metrics diff
to my pull request in GitHub, but I'm getting an error: token not found
. What does that mean?
Generally, token
refers to an authorization token that grants your runner
certain permissions with the GitHub API- such as the ability to post a comment
on your pull request. If you're working in GitHub, you don't have to follow any
manual steps to create a token. But you do need to make sure your
environmental variables in the workflow are named properly.
Make sure you've specified the following field in your workflow file:
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}
The variable must be called repo_token
for CML to recognize it!
A few other pointers:
- In GitLab, you have to set a variable in your repository called
repo_token
whose value is Personal Access token. We have step-by-step instructions in our docs. Forgetting to set this is the #1 issue we see with first-time GitLab CI users! - In BitBucket Cloud, you need to set a variable in your repository called
repo_token
whose value is your API credentials. We have detailed docs for creating this token, too. - Need to see more sample workflows to get a feel for it? We have plenty of case studies to examine.
Q: Is there any reason why an experimental DVC feature wouldn't work on the CML Docker container?
Generally, no- the container dvcorg/cml:latest
should have the latest DVC
release and the latest CML release (you can see where DVC and CML are installed
from in our
Dockerfile). So
besides the time it takes for releases to be published on various package
managers, there shouldn't be any lag. That means experimental features are ready
to play on your runner!
Note that you can also install pre-release versions of DVC- check out our docs about installing the latest stable version ahead of official releases.