Syncing Data to Azure Blob Storage

Setting up a remote to make data versioning easier with DVC is a common need so we're going to go through a tutorial for doing this with Azure.

Milecia McGregor
June 13, 2022 • 4 min read

Using Azure Blob Storage in DVC

When you’re working on a data science project that has huge datasets, it’s common to store them in cloud storage. You’ll also be working with different versions of the same datasets to train a model, so it’s crucial to have a tool that enables you to switch between datasets quickly and easily. That’s why we’re going to do a quick walkthrough of how to set up a remote with Azure Blob Storage and handle data versioning with DVC.

We’ll start by creating a new blob storage container in our Azure account, then we’ll show how you can add DVC to your project. We’ll be working with this repo if you want an example to play with.

By the time you finish, you should be able to create this setup for any machine learning project using an Azure remote.

Set up an Azure blob storage container

Make sure that you already have a Microsoft Azure account. When you log in, you should see a page like this.

Search for storage accounts in the search bar and click Storage accounts under Services. Make sure you don't click the "classic" option.

This will bring you to the Storage accounts page where you'll need click the Create storage account button.

Now you need to enter a Resource group and name for the account. You can create a new resource group right here, like we do, and call it BicycleProject. We'll name this storage account bicycleproject. Then you can leave all the default settings in place and click Review + create.

Azure will run validation on the account and then you'll be able to click Create and it will generate the storage account.

You'll get redirected to a new page and you should click the Go to resource button. Now you should see all of the details for your storage account. In the left sidebar, got to on Data storage > Containers.

Then click the + Container button at the top of the new page and you'll see a right sidebar open. In the name field, type bikedata and then click Create. Now we have everything set up for the blob storage to work.

Set the right roles for your Azure account

You'll need the right roles on your storage account and your container in order to connect this remote storage to your machine learning project.

On the page for your bicycleproject storage account, go to the Access Control (IAM) in the left sidebar.

On this page, you'll click Add role assignment and get directed to the page with all of the roles.

Select the Storage Blob Data Contributor role and click Next

Then you can click + Select members to add this role to your user.

You'll also need to go through this exact flow for your bikedata container, so make sure you do this immediately after your storage account is updated.

Since our Azure storage account and container have the correct roles now, let's set up the project!

Set up a DVC project

First, add DVC as a requirement to your project with the following installation command:

$ pip install 'dvc[azure]'

Then you can initialize DVC in your own project with the following command:

$ dvc init

This will add all of the DVC internals needed to start versioning your data and tracking experiments. Now we need to set up the remote to connect our project data stored in Azure to the DVC repo.

Create a default remote

Now we can add a default to the project with the following command:

$ dvc remote add -d bikes azure://bikedata

This creates a default remote called bikes that connects to the bikedata container we made earlier which is where the training data for the model will be stored.

Add Azure credentials

In order for DVC to be able to push and pull data from the remote, you need to have valid Azure credentials.

By default, DVC authenticates using your Azure CLI configuration.

Run the following command to authenticate with Azure.

$ az login
A web browser has been opened at https://login.microsoftonline.com/organizations/oauth2/v2.0/authorize. Please continue the login in the web browser. If no web browser is available or if the web browser fails to open, use device code flow with `az login --use-device-code`.
[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "some-id",
    "id": "some-id",
    "isDefault": true,
    "managedByTenants": [],
    "name": "Azure subscription 1",
    "state": "Enabled",
    "tenantId": "some-id",
    "user": {
      "name": "[email protected]",
      "type": "user"
    }
  }
]

This should open a window that looks like this where you can enter your login credentials.

You can check out more details on this command here in the Azure docs. If you want to use a different authentication method with DVC, check out our docs here.

You will also need to manually define the storage account name with the following command:

$ dvc remote modify bikes account_name 'bicycleproject'

Push and pull data with DVC

Now you can push data from your local machine to the Azure remote! First, add the data you want DVC to track with the following command:

$ dvc add data

This will allow DVC to track the entire data directory so it will note when any changes are made. Then you can push that data to your Azure remote with this command:

$ dvc push

Here's what the data might look like in your Azure container.

Then if you move to a different machine or someone else needs to use that data, it can be accessed by cloning or forking the project repo and running:

$ dvc pull

This will get any data from your remote and download it to your local machine.

Authentication has to be setup locally on any machine you need to pull or push data from. That means running the az login command on any other machine. You don't need to go through the DVC setup again.

That’s it! Now you can connect any DVC project to an Azure blob storage container. If you run into any issues, makes sure to check that your credentials are valid, check if your user has MFA enabled, and check that the user has the right level of permissions.

← Back to blog