Discord gems
Here are some Q&A's from our Discord channel that we think are worth sharing.
Q: How can I view and download files that are being tracked by DVC in a repository?
To list the files that are currently being tracked in a project repository by
DVC and Git, you can use dvc list
. This will display the contents of that
repository, including .dvc
files. To download the contents corresponding to a
particular .dvc
file, use dvc get
:
Let's consider an example using both functions. Assume we're working with DVC's data registry example repository. To list the files present, run:
$ dvc list -R https://github.com/iterative/dataset-registry
.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
...
Note that the -R
flag, which enables dvc list
to display the contents of
directories inside the repository. Now assume you want to download data.xml
,
which we can see is being tracked by DVC. To download the dataset to your local
workspace, you would then run
$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml
For more examples and information,
see the documents for
dvc list
and for dvc get
.
Q: I'm setting up cloud remote storage for DVC and I'd like to forbid dvc gc --cloud
so users can't accidently delete files in the remote. Will it be sufficient to restrict deletion in the remote's settings?
You're right to be careful, because dvc gc --cloud
can be dangerous in the
wrong hands- it'll remove any unused files in your remote (for more info,
see our docs). To prevent users
from having this power, setting your bucket policy to block object deletions
should do the trick. How to do this will depend on your cloud storage provider-
we found some relevant docs for
GCP,
S3,
and
Azure.
For the full list of supported remote storage types,
see here.
Q: My team is interested in DVC, and we have all of our data in remote storage. Do we need to install a centralised enterprise version of DVC on a dedicated server? And do we have to also have a GitHub repository?
There's no need for a DVC server. Our remote storage works on top of most kinds of cloud storage by default, including S3, GCP, Azure, Google Drive, and Aliyun, with no additional infrastructure required. As for GitHub (or BitBucket, or GitLab, etc.), this is only needed if you're interested in sharing your project with others over that channel. We like sharing projects on GitHub, but you don't have to. Any Git repository, even a local one, will do.
So a "minimal" DVC project for you might consist of a local workspace with Git enabled (which you do need), a local Git repository, and your S3 remote storage. Check out our use cases to see some examples of infrastructure and workflow for teams.
Q: Could there be any issues with concurrent dvc push
-es to the same remote?
There are a few ways for concurrency to occur: multiple jobs running in parallel on the same machine, or different users on different machines. But in any case, the answer is the same: there's nothing to worry about! When pushing a file to a DVC remote, all operations are non-destructive and atomic.
Q: How do I only download part of my remote repository? For example, I only need the final output of my pipeline, not the raw data or intermediate steps.
We support granular operations on DVC project repositories! Say your project's
DVC remote contains several .dvc
files corresponding to different stages of
your pipeline: 0_process_data.dvc
, 1_split_test_train.dvc
, and
2_train_model.dvc
. If you're only interested in the files output by the final
stage of the pipeline (2_train_model.dvc
), you can run:
$ dvc pull process_data_stage.dvc
You can also use dvc pull
at the level of individual files. This might be
needed if your DVC pipeline file creates 10 outputs, for example, and you only
want to pull one (say, model.pkl
, your trained model) from remote DVC storage.
You'd simply run
$ dvc pull model.pkl
Q: How can I remove a .dvc
file, but keep the associated files in my workspace?
Sometimes, you realize you don't want to put a file under DVC tracking after
all. That's okay, easy to fix. Simply remove the .dvc
file like any other-
rm <file>.dvc
. DVC will then stop tracking the file, and the associated target
file will still be in your local workspace. Note that the file will still be in
your
DVC cache
unless you clear it with dvc gc
.
Q: I'm trying to move a stage file with dvc move
, but I'm getting an error. What's going on?
The dvc move
command is used to rename a file or directory and simultaneously
modify its corresponding DVC file. It's handy so you don't rename a file in your
local workspace that's under DVC tracking without updating DVC to the change
(see an example here).
The function doesn't work on
"stage files" from DVC
pipelines. There's not currently an easy way to safely move dvc.yaml
files,
and it's an
open issue we're working on.
Until then, you can manually update dvc.yaml
, or make a new one in the desired
location.
Q: I just starting using DVC and noticed that when I dvc push
files to remote cloud storage, the directory in my remote looks like my DVC cache, not my local workspace directory. Is this right?
Yep, that's exactly how it should be! In order to provide deduplication and some
other optimizations, your DVC remote's directory structure will mirror the DVC
cache (which is by default in your local workspace under .dvc/cache
).
Effectively, DVC uses your Git repository to store DVC files, which are keys for
cache files on your remote. So looking inside your remote won't be particularly
enlightening if you're looking for human-readable filenames- the file names will
look like hashes (because, well, they are). Luckily, DVC handles all the
conversions between the filenames in your local workspace and these hashes.
To get some more intuition about this, check out some of our docs about how DVC organizes files.