DVC Questions
Q: I noticed I have a DVC config
file and a config.local
file. What's best practice for committing these to my Git repository?
DVC uses the config
and config.local
files to link your remote data
repository to your project. config
is intended to be committed to Git, while
config.local
is not - it's a file that you use to store sensitive information
(e.g. your personal credentials - username, password, access keys, etc. for
remote storage) or settings that are specific to your local environment.
Usually, you don't have to worry about ensuring your config.local
file is
being ignored by Git- the only way to create a config.local
file is using the
--local
flag explicitly in functions like dvc remote
and dvc config
commands, so you'll know you've made one! And your config.local
file is
.gitignored
by default. If you're concerned, take a look and make sure there
are no settings in your config.local
file that you actually want in your
regular config
file.
To learn more about config
and config.local
,
read up in our docs.
Q: What's the best way to install the new version of DVC in a Conda environment? I'm concerned about the paramiko
dependency.
When you install DVC via conda
, it will come with dependencies like
paramiko
.
The only exception when installing DVC as a Python library is with pip
: you
might want to specify the kind of remote storage you need to make sure all
dependencies are present (like boto
for S3). You can run
pip install "dvc[<option>]"
, with supported options like [s3]
, [azure]
,
[gdrive]
, [gs]
, [oss]
, [ssh]
. Or, use [all]
to include them all.
For more about installing DVC and its dependencies, check out our docs.
Q: How do I keep track of changes in modules that my DVC pipeline depends on? For example, I have a pipeline stage that runs a script prepare.py
, which imports a module module.py
. If module.py
changes, how will DVC know to rerun the pipeline stage?
If your DVC pipeline only lists prepare.py
as a dependency, then changing code
in module files won't trigger a re-run of the pipeline. Meaning that if you run
dvc repro
after updating module.py
, DVC will simply return the result of
your last pipeline run and a message that nothing has changed.
To explain further why this happens:
DVC is platform agnostic and it doesn't know whether your command's executable
is python
, some other script interpreter, or a compiled binary for that
matter.
E.g. this is a valid stage:
dvc run -o hello.txt 'echo "Hello!" > hello.txt'
(where the executable is echo).
DVC also doesn't know what's going on inside the command's source code.
Therefore, any file that your code requires internally should be explicitly
specified as a pipeline stage dependency (in CLI, dvc run -d
, or in YAML,
deps:
) for DVC to track it.
If you're not interested in adding modules as explicit dependencies, there are a few other approaches:
- Make your
requirements.txt
file a stage dependency (if the loaded module comes from a package). - Manually rebuild the pipeline (with
dvc repro --force <stage>.dvc
) when you know an unmarked dependency is changed – although this is prone to human error. - Have a version/build number comment in the main script that always gets updated when an unmarked dependency changes – this could be automated.
See here for more information on similar use cases.
We also have an ongoing discussion about this issue on our GitHub repository, and we'd love your input. Please participate in this issue if you can here!
Q: My DVC pipeline has a lot of dependencies, and I don't want to manually write them all out in my dvc.yaml
file. Are there any ways to use wildcards (like *
) or specify directories as dependencies?
Yes, you can set a directory to be a dependency or an output of a DVC pipeline stage. This means you can have tens, hundreds, thousands or millions of dependency files in one directory, and all you have to declare in the pipeline is the address of that directory.
Check out the all the options here.
CML Questions
Q: I heard there's a new CML feature using Terraform to provision runners. When is this coming out?
You're in luck, because we just shared this feature as part of the CML 0.3.0
pre-release! The pre-release introduced a new function, cml runner
, which
upgraded our
previous method for launching instances in the cloud from a CI workflow using Docker Machine.
In the new cml runner
function built on Terraform, you can deploy instances in
AWS and Azure with a single command (it used to take about 30 lines of code!).
For example, to launch a t2.micro
instance on AWS from your GitHub Actions or
GitLab CI workflow, you'll run:
cml runner \
--cloud aws \
--cloud-region us-west \
--cloud-type=t2.micro \
--labels=cml-runner
Check out the pre-release notes and our example project repository to get started.
Q: My CI workflow creates a [report.md](http://report.md)
document that gets published to my pull request by CML. I want to save the report.md
file to my repository, too. Is this possible?
By default, files that are created in a GitHub Actions or GitLab CI workflow
only exist on the runner- as soon as the runner turns off, they vanish.
Functions like cml publish
and cml send-comment
create persistent links to
data visualizations, tables, and other outputs of your workflow so you can view
them long after your run ends. However, by design, CML doesn't commit files to
your repository (not all users want this!)
What you're likely looking for is an auto-commit, to essentially git add
and
git commit
files generated by the workflow to your repository. You can
manually write this code into your workflow file, or you can use a GitHub Action
tool like the
Auto Commit or
Add & Commit Actions.
Q: Do you have any suggested caching strategies with CML and DVC? My DVC pipeline runs in a CI workflow, and it depends on ~15 GB of data. I don't want to download this dataset to my runner every time the workflow runs.
Downloading data to a runner on every CI workflow can be needlessly time consuming, particularly when the data rarely changes.
While we don't have a CML-specific mechanism in the works for this use case, there are two main approaches we see as viable:
- Attach an EBS volume to the instance that runs your workflow. If you're using DVC, DVC needs to run in that volume (at the very least, your DVC cache must be there). A user recently let us know that this approach is working well for them and prevents unnecessary re-downloads of their DVC cache. They also recommended this article for setup guidelines.
- Use a shared DVC cache. Currently, many DVC users configure their cache in shared NFS. A similar setup that might help here is using a single shared development server- check out our docs for a use case.
As always, if you have any use case questions or need support, join us in Discord! Or head to the DVC Forum to discuss your ideas and best practices.