Organize Your Storage with DVC Cloud Versioning

Major cloud providers (AWS, Azure, Google) all have versioning capabilities, but they only version individual files. DVC can help you track the cloud version IDs across many files for all the datasets and models in your project.

Dave Berenbaum
February 22, 2023 • 5 min read

If you use cloud storage regularly, you have probably seen it become a mess like this S3 bucket:

Luckily, major cloud storage providers can version files automatically. Still, even with versioning enabled, you might find you end up with a mess. More importantly, you forget which version is which.

That's because versioning happens at the file level. There's no way to version a composite dataset or entire machine learning project. This is where DVC can supplement cloud versioning and finally let you clean up your cloud storage. DVC records the versions of all the files in your dataset, so you have a complete snapshot of each point in time. You can store this record in Git alongside the rest of your project and use it to recover the data from that time, giving you the freedom to keep adding new data in place without fear of losing track of the old data. DVC ensures reproducibility while keeping everything organized between your Git repo and cloud storage, so you can focus on iterating on your machine learning project.

If you already use DVC, you might be familiar with data versioning and want to know what DVC cloud versioning means for you. Read the next section to get more familiar with cloud versioning generally or skip directly to the section for existing DVC users.

How cloud versioning works

With versioning enabled, whenever you save a file to the cloud, it will get a unique version ID. When you overwrite (or even delete) a file, the previous version remains accessible by referencing its version ID.

Here's the same data from above organized with cloud versioning:

Overwritten and deleted files may be recovered using their version IDs.

And here it is showing only the current versions:

Enabling versioning can keep your cloud storage organized by collapsing file versions.

Now the model versions are all collapsed under one file name and ordered by time, but what about the predictions folder? Let's assume this project trains a neural machine translation model, and each file in predictions is a predicted translation of a sentence. Each model iteration generates a new set of predictions. How can we reassemble the predictions from an earlier model version?

For a folder of many files, keeping track of versions becomes unrealistic.

How DVC works with cloud versioning

Cloud versioning falls short for tracking and syncing folders and projects, but this is where DVC can help. DVC records the version IDs of all files in your dataset or project. You keep this record in a Git repository so you can maintain snapshots of your cloud-versioned data (the data itself gets stored on the cloud, not in Git).

DVC connects multiple version IDs across a folder or project.

Before you start with DVC, ensure that your cloud storage is configured correctly. Cloud versioning must be enabled at the bucket or storage account level. See Quickstart for instructions below if versioning is not already enabled. You also need write access to the cloud storage (more info on how to configure your storage here).

To start using cloud versioning in DVC, install DVC and set up a version_aware remote inside a Git repo. A remote is the cloud storage location where you want to sync the data, and version_aware tells DVC to use cloud versioning.

$ dvc init

$ dvc remote add --default myremote s3://cloud-versioned-bucket/path

$ dvc remote modify myremote version_aware true

Use dvc add to start tracking your model and predictions and dvc push to sync it to the cloud.

$ dvc add model.pt predictions

$ dvc push
11 files pushed

If you want to start tracking changes to an existing cloud dataset instead of starting from a local copy, see dvc import-url —version-aware.

DVC adds model.pt.dvc and predictions.dvc files with the version ID (and other metadata) of each file.

outs:
  - path: predictions
    files:
      - relpath: 0.txt
        md5: f163358b0b2b89281d6990e82495d6ca
        size: 154
        cloud:
          myremote:
            etag: f163358b0b2b89281d6990e82495d6ca
            version_id: UkLM3za5T8oH6.EeZCqOrFNBvUnrAlT7
      - relpath: 1.txt
        md5: ec736fcb3b92886399f3577eac2163bb
        size: 154
        cloud:
          myremote:
            etag: ec736fcb3b92886399f3577eac2163bb
            version_id: fE4Fst2Z25sYEjaJo_0mXZzWDT6vQ4Uz

Next, track model.pt.dvc and predictions.dvc in Git.

$ git add model.pt.dvc predictions.dvc .gitignore

$ git commit -m "added and pushed model and predictions"

DVC will also make Git ignore model.pt and the predictions folder so that Git only tracks the metadata. For more info on the mechanics of how DVC works, see Versioning Data and Models.

Now there is a versioned record of the model and predictions in Git commits, and we can revert to any of them without having to manually track version IDs. If someone else clones the Git repo, they can pull the exact versions pushed with that commit, even if those have been overwritten in cloud storage.

$ git clone git@github.com:iterative/myrepo

$ cd myrepo

$ dvc pull
A       predictions/
A       model.pt
2 files added and 11 files fetched

For existing DVC users

If you have versioning enabled on your cloud storage (or can enable it), you may wish to start using version_aware remotes to simplify the structure of your remote (or so you don't have to explain that structure to your colleagues). A version_aware remote is similar to the remotes you already use, except easier to read.

A traditional cache-like DVC remote looks like:

A cloud-versioned remote looks like:

The other difference is that version IDs get added to the DVC metafiles during dvc push.

outs:
  - path: predictions
    files:
      - relpath: 0.txt
        md5: f163358b0b2b89281d6990e82495d6ca
        size: 154
        cloud:
          myremote:
            etag: f163358b0b2b89281d6990e82495d6ca
            version_id: UkLM3za5T8oH6.EeZCqOrFNBvUnrAlT7
      - relpath: 1.txt
        md5: ec736fcb3b92886399f3577eac2163bb
        size: 154
        cloud:
          myremote:
            etag: ec736fcb3b92886399f3577eac2163bb
            version_id: fE4Fst2Z25sYEjaJo_0mXZzWDT6vQ4Uz

This means you need to be more careful about the order in which you dvc push and git commit. You should first dvc push and then git commit since pushing will modify the DVC metafiles. This might seem odd, but it means you have a record in Git of what was pushed, so there is no more guessing whether you remembered to push.

Quickstart

You can start with DVC cloud versioning in 3 steps:

1. Check whether cloud versioning is enabled for your bucket/storage account, and enable it if it's not.

2. Setup DVC to use that bucket/container as cloud-versioned remote storage.

$ dvc init

$ dvc remote add --default myremote s3://cloud-versioned-bucket/path

$ dvc remote modify myremote version_aware true

3. Add and then push data.

$ dvc add model.pt predictions

$ dvc push

Stop messing around with backing up your cloud data! With cloud versioning in DVC, you can iterate on your data as much as you want without losing track of your changes or worrying about your storage growing into an unmanageable mess.

Special thanks to Peter Rowlands for leading the development of this new capability!