DVC 2.0 Release
Today is DVC 2.0 release day! Watch a video from DVC-team when we explain the new features and read more details in this blog post. New features: 🧪 Lightweight ML experiments 📍 ML model checkpoints versioning 📈 Dvc-live - new open-source library for metrics logging 🔗 ML pipeline templating and iterative foreach-stages 🤖 CML - new way to get GPU/CPU in clouds and GitHub Actions support
DVC 2.0 Release
TL;DR; video
What is new in DVC 2.0?
We have been working on DVC for almost 4 years. In the previous versions, we have built a great foundation on versioning data, code and ML models that helps make your ML projects reproducible.
With the 2.0 release, we are going deeper into machine learning and deep learning scenarios such as experiment management, ML model checkpoints and ML metrics logging. These scenarios are widely adopted by ML practitioners and instrumented with custom tools or external frameworks and SaaS services. Our vision is to make the ML experimentation experience distributed (like Git) and independent of external SaaS platforms, and to introduce proper data and model management to ML experiments.
⚠️ DVC 2.0 is the first release with ML experements, which is still in experimentation mode (yeah, experiments in experimentation mode 😅), so the API might change a bit in the following releases.
ML pipelines parametrization is another big improvement in DVC 2.0. This was the most requested feature during the last year. We are introducing variables in pipelines as well as foreach-stages. This is a significant improvement for users who work on multi-stages ML projects, which is very common for NLP projects.
A better CPU/GPU resource allocation is another important direction for DVC. Together with DVC 2.0 we are releasing new version 0.3 of CML (CI/CD for ML). It aims to hide all complexity of clouds from data scientists and ML engineers. We developed a brand new Iterative Terraform Provider to reach this goal and simplify the end-user experience. In future releases, we expect DVC to use this Terraform provider to access cloud resources directly.
The last but not least important part - we made the new release with minimum breaking changes to our API. That makes migration to DVC 2.0 smooth and low-risk.
Install
The new version is generally available!
Install DVC 2.0 through OS packages or as Python library:
$ pip install --upgrade dvc
CML is pre-installed in the CML docker containers (e.g.
iterativeai/cml:0-dvc2-base1
) and also available as an NPM package:
$ npm i -g @dvcorg/cml
Lightweight ML experiments
DVC uses Git versioning as the basis for ML experiments. This solid foundation makes each experiment reproducible and accessible from the project's history. This Git-based approach works very well for ML projects with mature models when only a few new experiments per day are run.
However, in more active development, when dozens or hundreds of experiments need
to be run in a single day, Git creates overhead — each experiment run requires
additional Git commands git add/commit
, and comparing all experiments is
difficult.
We are introducing lightweight experiments in DVC 2.0! This is how you can auto-track ML experiments without any overhead.
⚠️ Note, our new ML experiment features (dvc exp
) are experimental. This means
that the commands might change a bit in the following minor releases.
dvc exp run
can run an ML experiment with a new hyperparameter from
params.yaml
while dvc exp diff
shows metrics and params difference:
$ dvc exp run --set-param featurize.max_features=3000
Reproduced experiment(s): exp-bb55c
Experiment results have been applied to your workspace.
$ dvc exp diff
Path Metric Value Change
scores.json auc 0.57462 0.0072197
Path Param Value Change
params.yaml featurize.max_features 3000 1500
More experiments:
$ dvc exp run --set-param featurize.max_features=4000
Reproduced experiment(s): exp-9bf22
Experiment results have been applied to your workspace.
$ dvc exp run --set-param featurize.max_features=5000
Reproduced experiment(s): exp-63ee0
Experiment results have been applied to your workspace.
$ dvc exp run --set-param featurize.max_features=5000 \
--set-param featurize.ngrams=3
Reproduced experiment(s): exp-80655
Experiment results have been applied to your workspace.
In the examples above, hyperparameters were changed with the --set-param
option, but you can make these changes by modifying the params file instead. In
fact any code can be changed and dvc exp run
will capture the variations.
See all the runs:
$ dvc exp show --no-pager --no-timestamp \
--include-params featurize.max_features,featurize.ngrams
─────────────────────────────────────────────────────────────────────
Experiment auc featurize.max_features
featurize.ngrams ─────────────────────────────────────────────────────────────────────
workspace 0.56359 5000 3
master 0.5674 1500 2
├── exp-80655 0.56359 5000 3
├── exp-63ee0 0.5515 5000 2
├── exp-9bf22 0.56448 4000 2
└── exp-bb55c 0.57462 3000 2
─────────────────────────────────────────────────────────────────────
Under the hood, DVC uses Git to store the experiments' meta-information. A
straight-forward implementation would create visible branches and auto-commit in
them, but that approach would over-pollute the branch namespace very quickly. To
avoid this issue, we introduced custom Git references exps
, the same way as
GitHub uses custom references pulls
to track pull requests (this is an
interesting technical topic that deserves a separate blog post). Below you can
see how it works.
No artificial branches, only custom references exps
(do not worry if you don't
understand this part - it is an implementation detail):
$ git branch
* master
$ git show-ref
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_APPLY
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_BRANCH
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/71/67904d89e116f28daf7a6e4c0878268117c893/exp-80655
f16e7b7c804cf52d91d1d11850c15963fb2a8d7b refs/exps/97/d69af70c6fb4bc59aefb9a87437dcd28b3bde4/exp-63ee0
0566d42cddb3a8c4eb533f31027f0febccbbc2dd refs/exps/91/94265d5acd847e1c439dd859aa74b1fc3d73ad/exp-bb55c
9bb067559583990a8c5d499d7435c35a7c9417b7 refs/exps/49/5c835cd36772123e82e812d96eabcce320f7ec/exp-9bf22
The best experiment can be promoted to the workspace and committed to Git.
$ dvc exp apply exp-bb55c
$ git add .
$ git commit -m 'optimize max feature size'
Alternatively, an experiment can be promoted to a branch (big_fr_size
branch
in this case):
$ dvc exp branch exp-80655 big_fr_size
Git branch 'big_fr_size' has been created from experiment 'exp-c695f'.
To switch to the new branch run:
git checkout big_fr_size
Remove all the experiments that were not used:
$ dvc exp gc --workspace --force
ML model checkpoints versioning
ML model checkpoints are an essential part of deep learning. ML engineers prefer to save the model files (or weights) at checkpoints during a training process and return back when metrics start diverging or learning is not fast enough.
The checkpoints create a different dynamics around ML modeling process and need a special support from the toolset:
- Track and save model checkpoints (DVC outputs) periodically, not only the final result or training epoch.
- Save metrics corresponding to each of the checkpoints.
- Reuse checkpoints - warm-start training with an existing model file, corresponding code, dataset version and metrics.
This new behavior is supported in DVC 2.0. Now, DVC can version all your checkpoints with corresponding code and data. It brings the reproducibility of DL processes to the next level - every checkpoint is reproducible.
This is how you define checkpoints with live-metrics:
$ dvc stage add -n train \
-d users.csv -d train.py \
-p dropout,epochs,lr,process \
--checkpoint model.h5 \
--live logs \
python train.py
Creating 'dvc.yaml'
Adding stage 'train' in 'dvc.yaml'
Note, we use dvc stage add
command instead of dvc run
. Starting from DVC 2.0
we begin extracting all stage specific functionality under dvc stage
umbrella.
dvc run
is still working, but will be deprecated in the following major DVC
version (most likely in 3.0).
Start the training process and interrupt it after 5 epochs:
$ dvc exp run
'users.csv.dvc' didn't change, skipping
Running stage 'train':
> python train.py
...
^CTraceback (most recent call last):
...
KeyboardInterrupt
Navigate in checkpoints:
$ dvc exp show --no-pager --no-timestamp
──────────────────────────────────────────────────────────────────────
Experiment step loss accuracy val_loss … epochs
… ──────────────────────────────────────────────────────────────────────
workspace 4 2.0702 0.30388 2.025 … 5 …
master - - - - … 5 …
│ ╓ exp-e15bc 4 2.0702 0.30388 2.025 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
──────────────────────────────────────────────────────────────────────
Each of the checkpoints above is a separate experiment with all data, code,
paramaters and metrics. You can use the same dvc exp apply
command to extract
any of these.
Another run continues this process. You can see how accuracy metrics are increasing - DVC does not remove the model/checkpoint and training code trains on top of it:
$ dvc exp run
Existing checkpoint experiment 'exp-e15bc' will be resumed
...
^C
KeyboardInterrupt
$ dvc exp show --no-pager --no-timestamp
──────────────────────────────────────────────────────────────────────
Experiment step loss accuracy val_loss … epochs
… ──────────────────────────────────────────────────────────────────────
workspace 9 1.7845 0.58125 1.7381 … 5 …
master - - - - … 5 …
│ ╓ exp-e15bc 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 …
│ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 …
│ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 …
│ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
──────────────────────────────────────────────────────────────────────
After modifying the code, data, or params, the same process can be resumed. DVC
recognizes the change and shows it (see experiment b363267
):
$ vi train.py # modify code
$ vi params.yaml # modify params
$ dvc exp run
Modified checkpoint experiment based on 'exp-e15bc' will be created
...
$ dvc exp show --no-pager --no-timestamp
──────────────────────────────────────────────────────────────────────────────
Experiment step loss accuracy val_loss … epochs
… ──────────────────────────────────────────────────────────────────────────────
workspace 13 1.5841 0.69262 1.5381 … 15 …
master - - - - … 5 …
│ ╓ exp-7ff06 13 1.5841 0.69262 1.5381 … 15 …
│ ╟ 6c62fec 12 1.6325 0.67248 1.5857 … 15 …
│ ╟ 4baca3c 11 1.6817 0.64855 1.6349 … 15 …
│ ╟ b363267 (2b06de7) 10 1.7323 0.61925 1.6857 … 15 …
│ ╓ 2b06de7 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 …
│ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 …
│ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 …
│ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 …
│ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 …
│ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 …
│ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 …
│ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 …
│ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 …
├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 …
──────────────────────────────────────────────────────────────────────────────
Sometimes you might need to train the model from scratch. The reset option
removes the checkpoint file before training: dvc exp run --reset
.
Metrics logging
Continuously logging ML metrics is a very common practice in the ML world. Instead of a simple command-line output with the metrics values, many ML engineers prefer visuals and plots. These plots can be organized in a "database" of ML experiments to keep track of a project. There are many special solutions for metrics collecting and experiment tracking such as sacred, mlflow, weight and biases, neptune.ai, or others.
With DVC 2.0, we are releasing a new open-source library
DVC-Live that provides functionality for
tracking model metrics and organizing metrics in simple text files in a way that
DVC can visualize the metrics with navigation in Git history. So, DVC can show
you a metrics difference between the current model and a model in master
or
any other branch.
This approach is similar to the other metrics tracking tools with the difference that Git becomes a "database" or of ML experiments.
Generate metrics file
Install the library:
$ pip install dvclive
Instrument your code:
import dvclive
from dvclive.keras import DvcLiveCallback
dvclive.init("logs") #, summarize=True)
...
model.fit(...
# Set up DVC-Live callback:
callbacks=[ DvcLiveCallback() ]
)
During the training you will see the metrics files that are continuously populated each epochs:
$ ls logs/
accuracy.tsv loss.tsv val_accuracy.tsv val_loss.tsv
$ head logs/accuracy.tsv
timestamp step accuracy
1613645582716 0 0.7360000014305115
1613645585478 1 0.8349999785423279
1613645587322 2 0.8830000162124634
1613645589125 3 0.9049999713897705
1613645590891 4 0.9070000052452087
1613645592681 5 0.9279999732971191
1613645594490 6 0.9430000185966492
1613645596232 7 0.9369999766349792
1613645598034 8 0.9430000185966492
In addition to the continuous metrics files, you will see the summary metrics file and HTML file with the same file prefix. The summary file contains the result of the latest epoch:
$ cat logs.json | python -m json.tool
{
"step": 41,
"loss": 0.015958430245518684,
"accuracy": 0.9950000047683716,
"val_loss": 13.705962181091309,
"val_accuracy": 0.5149999856948853
}
The HTML file contains all the visuals for continuous metrics as well as the summary metrics on a single page:
Note, the HTML and the summary metrics files are generating automatically for each. So, you can monitor model performance in realtime.
Git-navigation with the metrics file
DVC repository is NOT required to use the live metrics functionality from the above. It works independently from DVC.
DVC repository becomes useful when the metrics and plots are committed in your Git repository, and you need navigation around the metrics.
Metrics difference between workspace and the last Git commit:
$ git status -s
M logs.json
M logs/accuracy.tsv
M logs/loss.tsv
M logs/val_accuracy.tsv
M logs/val_loss.tsv
M train.py
?? model.h5
$ dvc metrics diff --target logs.json
Path Metric Old New Change
logs.json accuracy 0.995 0.99 -0.005
logs.json loss 0.01596 0.03036 0.0144
logs.json step 41 36 -5
logs.json val_accuracy 0.515 0.5175 0.0025
logs.json val_loss 13.70596 3.29033 -10.41563
The difference between a particular commit/branch/tag or between two commits:
$ dvc metrics diff --target logs.json HEAD^ 47b85c
Path Metric Old New Change
logs.json accuracy 0.995 0.998 0.003
logs.json loss 0.01596 0.01951 0.00355
logs.json step 41 82 41
logs.json val_accuracy 0.515 0.51 -0.005
logs.json val_loss 13.70596 5.83056 -7.8754
The same Git-navigation works with the plots:
$ dvc plots diff --target logs
file:///Users/dmitry/src/exp-dc/plots.html
Another nice thing about the live metrics - they work across ML experiments and
checkpoints, if properly set up in dvc stages. To set up live metrics, you need
to specify the metrics directory in the live
section of a stage:
stages:
train:
cmd: python train.py
live:
logs:
cache: false
summary: true
report: true
deps:
- data
ML pipelines parameterization and foreach stages
After introducing the multi-stage pipeline file dvc.yaml
, it was quickly
adopted among our users. The DVC team got tons of positive feedback from them,
as well as feature requests.
Pipeline parameters from vars
The most requested feature was the ability to use parameters in dvc.yaml
. For
example. So, you can pass the same seed value or filename to multiple stages in
the pipeline.
vars:
- train_matrix: train.pkl
- test_matrix: test.pkl
- seed: 20210215
...
stages:
process:
cmd: python process.py \
--seed ${seed} \
--train ${train_matrix} \
--test ${test_matrix}
outs:
- ${test_matrix}
- ${train_matrix}
...
train:
cmd: python train.py ${train_matrix} --seed ${seed}
deps:
- ${train_matrix}
Also, it gives an ability to localize all the important parameters in a single
vars
block and play with them. This is a natural thing to do for scenarios
like NLP or when hyperparameter optimization is happening not only in the model
training code but in the data processing as well.
Pipeline parameters from params files
It is quite common to define pipeline parameters in a config file or a
parameters file (like params.yaml
) instead of in the pipeline file dvc.yaml
itself. These parameters defined in params.yaml
can also be used in
dvc.yaml
.
# params.yaml
models:
us:
thresh: 10
filename: 'model-us.hdf5'
# dvc.yaml
stages:
build-us:
cmd: >-
python script.py
--out ${models.us.filename}
--thresh ${models.us.thresh}
outs:
- ${models.us.filename}
DVC properly tracks params dependencies for each stage starting from the
previous DVC version 1.0. See the
--params
option
of dvc run
for more details.
Iterating over params with foreach stages
Iterating over params was a frequently requested feature. Now users can define multiple similar stages with a templatized command.
stages:
build:
foreach:
gb:
thresh: 15
filename: 'model-gb.hdf5'
us:
thresh: 10
filename: 'model-us.hdf5'
do:
cmd: >-
python script.py --out ${item.filename} --thresh ${item.thresh}
outs:
- ${item.filename}
New method to provision cloud compute in new CML release
We are releasing new CML release 0.3 together with DVC 2.0. We developed a brand
new CML command cml runner
that hides much of the complexity of configuring
and provisioning an instance, keeping your workflows free of bash scripting
clutter.
The new approach uses our new Iterative Terraform Provider under the hood instead of Docker Machine, as in the first version of CML.
This example workflow to launch an EC2 instance from a GitHub Action workflow and then train a model. We hope you'll agree it's shorter, sweeter, and more powerful than ever!
name: 'Train in the cloud'
on: [push]
jobs:
deploy-runner:
runs-on: [ubuntu-latest]
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- name: deploy
shell: bash
env:
repo_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner \
--cloud aws \
--cloud-region us-west \
--cloud-type=t2.micro \
--labels=cml-runner
train-model:
needs: deploy-runner
runs-on: [self-hosted, cml-runner]
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: 'Train my model'
run: |
pip install -r requirements.txt
python train.py
You'll get a pull request that looks something like this:
All the code to replicate this example is up on a brand new demo repository.
Please find more details in the CML 0.3 pre-release blog post or in the CML website.
GitHub Actions in new CML release
One more thing: you might've noticed in our example workflow above that there's a new CML GitHub Action! The new Action helps you setup CML, giving you one more way to mix and match the CML suite of functions with your preferred environment.
The new Action is designed to be a straightforward, all-in-one install that
gives you immediate use of functions like cml publish
and cml runner
. You'll
add this step to your workflow:
steps:
- uses: actions/checkout@v2
- uses: iterative/setup-cml@v1
The same way you can reference DVC as a GitHub Action:
steps:
- uses: actions/checkout@v2
- uses: iterative/dvc-action@v1
Breaking changes
We put a lot of efforts to make this release with very minimum amount of breaking changes to simplify migration to the new version for the users:
- Dropped support for external outputs in Google Cloud Storage and changed the default checksum from md5 to etag.
- Dropped support for login with p12 files on service authentication for Google Drive.
- Stages without dependencies will not always run as if changed. Instead, use
--always-changed
. - Environment variables inside the cmd of a stage using
${VAR}
syntax must be escaped as\${VAR}
in 2.0 due to the use of${}
syntax for templating.
Thank you!
Thank you to all DVC users and community members for the help. Please try out the new DVC and CML releases and do not get lost in your ML experiments!