Training and saving models with CML on a self-hosted AWS EC2 runner (part 1)
We can use CML to automatically retrain models whenever data, model code, or parameters change. In this guide we show how to create a pipeline that provisions an AWS EC2 instance to retrain a model and save the output on a regular basis. This way we can prevent drift by ensuring that our model always uses the latest input data.
When you first develop a machine learning model, you will probably do so on your local machine. You can easily change algorithms, parameters, and input data right in your text editor, notebook, or terminal. Imagine you have a long-running model for which you want to detect possible drift, however. In that case it would be beneficial to automatically retrain your model on a regular basis.
In this guide, we will show how you can use CML (Continuous Machine Learning) to do just that. CML is an open-source library for implementing continuous integration and delivery (CI/CD) in machine learning projects. This way we can define a pipeline to train a model and keep track of various versions. Although we could do so directly in our CI/CD pipeline (e.g. GitHub Actions Workflows), the runners used for this generally don’t have a lot of processing power. Therefore it makes more sense to provision a dedicated runner that is tailored to our computing needs.
At the end of this guide we will have set up a CML workflow that does the following on a daily basis:
- Provision an Amazon Web Services (AWS) EC2 instance
- Train the model
- Save the model and its metrics to a GitHub repository
- Create a pull request with the new outputs
- Terminate the AWS EC2 instance
In a follow-up post we will expand upon this by using DVC to designate a remote storage for our resulting models. But let's focus on CML first!
All files needed for this guide can be found in this repository.
This guide can be followed on its own, but also as an extension to this example in the docs.
We wil be using GitHub for our CI/CD and AWS for our computing resources. With slight modifications, however, you can use GitLab CI/CD, Google Cloud or Microsoft Azure.
Prerequisites
Before we begin, make sure you have the following things set up:
- You have created an AWS account (free tier suffices)
- You have
created a
PERSONAL_ACCESS_TOKEN
on GitHub with therepo
scope - You have
created an
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
on AWS - You have
added the
PERSONAL_ACCES_TOKEN
,AWS_ACCESS_KEY_ID
, andAWS_SECRET_ACCESS_KEY
as GitHub secrets
It also helps to clone the template repository for this tutorial.
Training a model and saving it
To kick off, we will adapt train.py
from the
CML getting started guide. Here we create a
simple RandomForestClassifier()
based on some generated data. We then use the
model to make some predictions and plot those predictions in a confusion matrix.
While running the script the model is kept in memory, meaning it is discarded as
soon as the script finishes. In order to save the model for later, we need to
dump it as a binary file. We do so with
joblib.dump()
.
Later we can read the model using
joblib.load()
when we need to.
You can also use pickle.dump()
if you prefer.
The outputs of train.py
are:
metrics.txt
: a file containing metrics on model performance (in this case accuracy)confusion_matrix.png
: a plot showing the classification results of our modelrandom_forest.joblib
: the binary output of the trained model
All of these files are saved to the model
directory.
import json
import os
import joblib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import plot_confusion_matrix
# Read in data
X_train = np.genfromtxt("data/train_features.csv")
y_train = np.genfromtxt("data/train_labels.csv")
X_test = np.genfromtxt("data/test_features.csv")
y_test = np.genfromtxt("data/test_labels.csv")
# Fit a model
depth = 5
clf = RandomForestClassifier(max_depth=depth)
clf.fit(X_train, y_train)
# Calculate accuracy
acc = clf.score(X_test, y_test)
print(acc)
# Create model folder if it does not yet exist
if not os.path.exists("model"):
os.makedirs("model")
# Write metrics to file
with open("model/metrics.txt", "w+") as outfile:
outfile.write("Accuracy: " + str(acc) + "\n")
# Plot confusion matrix
disp = plot_confusion_matrix(clf, X_test, y_test, normalize="true", cmap=plt.cm.Blues)
plt.savefig("model/confusion_matrix.png")
# Save the model
joblib.dump(clf, "model/random_forest.joblib")
Train the model on a daily basis
Now that we have a script to train our model and save it as a file, let’s set up
our CI/CD to provision a runner and run the script. We define our workflow in
cml.yaml
and save it in the .github/workflows
directory. This way GitHub
will automatically run the workflow whenever it is triggered. In this case the
triggers are on (manual) request as well as daily (automatic) schedule.
The name of the workflow doesn’t matter, as long as it’s a .yaml
and located
in the .github/workflows
directory. You can have multiple workflows in there
as well. You can learn more in the
documentation
here.
name: CML
on: # Here we use two triggers: manually and daily at 08:00
workflow_dispatch:
schedule:
- cron: '0 8 * * *'
jobs:
deploy-runner:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: iterative/setup-cml@v1
- name: Deploy runner on EC2
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner \
--cloud=aws \
--cloud-region=eu-west \
--cloud-type=t2.micro \
--labels=cml-runner \
--single
train-model:
needs: deploy-runner
runs-on: [self-hosted, cml-runner]
timeout-minutes: 120 # 2h
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.x'
- uses: actions/setup-node@v3
with:
node-version: '16'
- uses: iterative/setup-cml@v1
- name: Train model
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
run: |
cml ci
pip install -r requirements.txt
python get_data.py
python train.py
In this example we are using a t2.micro
AWS EC2 instance. At the time of
writing this is included in the AWS free tier. Make sure that you qualify for
this free usage to prevent unexpected spending. When you specify a bulkier
cloud-type
, your expenses will rise.
The workflow we defined first provisions a runner on AWS, and then uses that runner to train the model. After completing the training job, CML automatically terminates the runner to prevent you from incurring further costs. Once the runner is terminated, however, the model is lost along with it. Let's see how we can save our model in the next step!
Export the model to our Git repository
CML allows us to export the model from our runner to our Git repository. Let's
extend the training stage of our workflow by pushing random_forest.joblib
to a
new experiment branch and creating a pull request.
cml pr
is the command that specifies which files
should be included in the pull request. The commands after that are used to
generate a report in the pull request that displays the confusion matrix and
calculated metrics.
train-model:
needs: deploy-runner
runs-on: [self-hosted, cml-runner]
timeout-minutes: 120 # 2h
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.x'
- uses: actions/setup-node@v3
with:
node-version: '16'
- uses: iterative/setup-cml@v1
- name: Train model
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
run: |
cml ci
pip install -r requirements.txt
python get_data.py
python train.py
# Create pull request
cml pr model/random_forest.joblib
# Create CML report
cat model/metrics.txt > report.md
cml publish model/confusion_matrix.png --md >> report.md
cml send-comment --pr --update report.md
Et voilà! We are now running a daily model training on an AWS EC2 instance and saving the resulting model to our GitHub repository.
There is still some room for improvement, though. This approach works well when our resulting model is small (less than 100MB), but we wouldn't want to store large models in our Git repository. In a follow-up post we will describe how we can use DVC, another Iterative open-source tool, for storage when we're dealing with larger files.
Conclusions
There are many cases in which it's a good idea to retrain models periodically. For example, you could be using the latest data available to you in order to prevent model drift. CML allows you to automate this process.
In this guide, we explored how to set up CML for a daily training job using a self-hosted runner. We automatically provisioned this runner on AWS, exported the resulting files to our Git repository, and terminated the runner to prevent racking up our AWS bill.
In a follow-up post we will explore how to use DVC when the resulting model is too large to store directly in our Git repository.
Another great extension of our CI/CD would be a deploy
step to bring the
latest version of our model into production. This step might be conditional on
the performance of the model; we could decide to only start using it in
production if it performs better than previous iterations. All of this warrants
a guide of its own, however, so look out for that in the future! 😉