Tech Blog

Data Version Control in Machine Learning Models

Sep 22, 2022

By Jake Gellatly, Junior Machine Learning Engineer

Introduction: The core cycle of Model development

Machine Learning Lifecycle Diagram

When developing ML models, there is a continuous feedback loop in order to improve their performance. Initially, there is a wave of data gathering, preparation, wrangling, and analysis in order to ensure enough data for representative classes and/or objects. After training and testing version one of the model, it can then be sent to deployment in order to address its particular use case. However, this is just the first round of the model life cycle.

Problems: Integration of new data into the model

Human Automation and AI Output

Once the model has been released into the wild, it will often encounter new data that was not adequately covered in the training dataset. To address this, there can be a human in the loop for the model development process.  They will evaluate the models performance on real world data, correct annotations on misclassified images, and then send these images back into the model as training data. This process, in theory, could be a never ending cycle. Continually adding new training data into the model will enable it to perform better on edge cases that may not have been present in the initial training dataset. Managing these datasets as they continue to grow is no small task, and if improperly structured, can lead to messy data and scripts that are not easily reproduced in future analysis.

Solutions : Data Version Control

Version Control Diagram

Data Version Control (DVC) (https://dvc.org/) is a software used to manage both the scripts and data that are used in model development. Unlike git which can only store small text-based files, DVC allows you to apply the same version control principles to large datasets that cannot be stored on git themselves. It allows you to take “snapshots” of the data, features, configuration, code, and the model itself, and stage them before committing to a version control software. 

Version Control Diagram Two

For our problem of integrating new data into a model, this is a perfect solution! It allows you to continuously improve a model, track the data that went into training, and retrain. Should something go wrong, it is quick and easy to revert the model to a previous commit. Further, it allows you to quickly and easily share the data and models by simply cloning the GitHub repo that tracks the model and running ‘dvc pull’. 

Below we will show a small working example of how to get started with DVC.

Implementation

Part 1: GitHub initialization

In order to get started, you must install DVC. If you wish to use a remote file storage system (example below), you must also have it properly configured. In our case, this is done beforehand with the AWS CLI. Then follow the steps below in order to set up the GitHub repo.

  1. git clone https://github.com/FoxyAI/dvc_tst/
  2. cd dvc_tst/
  3. echo “dvc_test_commit” >> README.md
  4. git add README.md 
  5. git commit
  6. git push
(FoxyPY) DevBox@Ec2:~/Blog$ git clone https://github.com/FoxyAI/dvc_tst/
Cloning into 'dvc_tst'...
remote: Enumerating objects: 11, done. Ppl
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 11 (delta 0), reused 7 (delta 0), pack-reused 0
Unpacking objects: 100% (11/11), done.
(FoxyPY) DevBox@Ec2:~/Blog$ cd dvc_tst/
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ echo "dvc_test_commit" >> README.md
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git add README.md 
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git commit
[main eab93e5] commit
 1 file changed, 1 insertion(+)
 create mode 100644 README.md
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git push
Counting objects: 3, done.
Writing objects: 100% (3/3), 249 bytes | 249.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/FoxyAI/dvc_tst/
   a2bc499..eab93e5  main -> main

 

 

Part 2: DVC initialization

Now that the GitHub repo is properly set up, you can initialize DVC in order to track larger files (training data, model weights, etc) in a folder contained within the GitHub repo.

  1. dvc init
  2. mkdir data
  3. dvc add data
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc init
Initialized DVC repository.
 
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ mkdir data
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc add data
100% Adding...|███████████████████████████████████████|1/1 [00:00, 27.86file/s]
                                                                                                         
 

Now DVC will track any files contained within the data folder, while GitHub will track all scripts and smaller files that are contained outside of this folder. To illustrate this, we can add a few test files to see how this works. First, we can use git status in order to get everything synchronized with GitHub.

  1. git status
  2. git add .gitignore data.dvc
  3. git commit
  4. git push
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git status
On branch main
Your branch is up to date with 'origin/main'.
 
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
 
    new file:   .dvc/.gitignore
    new file:   .dvc/config
    new file:   .dvcignore
 
Untracked files:
  (use "git add <file>..." to include in what will be committed)
 
    .gitignore
    data.dvc
 
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git add .gitignore data.dvc 
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git commit
[main 2da4203] dvc start
 5 files changed, 12 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore
 create mode 100644 .gitignore
 create mode 100644 data.dvc
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git push
Counting objects: 8, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (8/8), 695 bytes | 695.00 KiB/s, done.
Total 8 (delta 0), reused 0 (delta 0)
To https://github.com/FoxyAI/dvc_tst/
   770a9aa..2da4203  main -> main

 

Part 3: DVC file tracking

Now we can add a couple of test files to illustrate how these components come together!

  1. touch github_tracked_file.txt # adding a file in the main repo
  2. cd data/
  3. touch dvc_tracked_file.txt # adding a file in the dvc tracked folder
  4. cd ..
  5. tree
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ touch github_tracked_file.txt
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ cd data/
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst/data$ touch dvc_tracked_file.txt
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst/data$ cd ..
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ tree
.
├── README.md
├── data
│   └── dvc_tracked_file.txt
├── data.dvc
└── github_tracked_file.txt
 
1 directory, 4 files

 

Part 4: Remote file hosting

We can now link this repo to a remote file storage system. This makes it very easy for collaborators to access your data, directly from a model repository!!! 

Once this is done, we rerun ‘dvc add data’ in order to update the status of the data folder (contains a new folder), and then commit and push it to GitHub.

  1. dvc remote add -d dvc-tst s3://dvc-tst
  2. dvc add data
  3.  git status
  4.  git add .dvc/config  data.dvc github_tracked_file.txt 
  5.  git commit
  6. git push
  7. dvc push
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc remote add -d dvc-tst s3://dvc-tst
Setting 'dvc-tst' as a default remote.
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc add data
100% Adding...|████████████████████████████████████████|1/1 [00:00, 40.69file/s]
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git status
On branch main
Your branch is up to date with 'origin/main'.
 
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
 
    modified:   .dvc/config
    modified:   data.dvc
 
Untracked files:
  (use "git add <file>..." to include in what will be committed)
 
    github_tracked_file.txt
 
no changes added to commit (use "git add" and/or "git commit -a")
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git add .dvc/config  data.dvc github_tracked_file.txt 
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git commit
[main a0f776c] remote track
 3 files changed, 6 insertions(+), 2 deletions(-)
 create mode 100644 github_tracked_file.txt
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git push
Counting objects: 5, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 622 bytes | 622.00 KiB/s, done.
Total 5 (delta 0), reused 0 (delta 0)
To https://github.com/FoxyAI/dvc_tst/
   2da4203..a0f776c  main -> main
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc push
2 files pushed

Part 5: Sharing and synchronization

Now everything is fully set up! Any changes within the data folder can be tracked with ‘dvc add data’, followed by a git commit and push. These changes to your data folder are version controlled and synchronized to the S3 external storage system. 

In order to illustrate this concept, we can test how this repo behaves for a “new” user, whether it be a collaborator who you want to share scripts and data with, or your future self updating a model with new data! First, we can migrate to a new directory, clone the repo, and run ‘tree’ to see the files contained.

  1. mkdir new_collab
  2. cd new_collab/
  3. git clone https://github.com/FoxyAI/dvc_tst/
  4. cd dvc_tst/
  5.  tree
(FoxyPY) DevBox@Ec2:~/Blog$ mkdir new_collab
(FoxyPY) DevBox@Ec2:~/Blog$ cd new_collab/
(FoxyPY) DevBox@Ec2:~/Blog/new_collab$ git clone https://github.com/FoxyAI/dvc_tst/
Cloning into 'dvc_tst'...
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (26/26), done.
remote: Total 35 (delta 4), reused 20 (delta 0), pack-reused 0
Unpacking objects: 100% (35/35), done.
(FoxyPY) DevBox@Ec2:~/Blog/new_collab$ cd dvc_tst/
(FoxyPY) DevBox@Ec2:~/Blog/new_collab/dvc_tst$ tree
.
├── README.md
├── data.dvc
└── github_tracked_file.txt
 
0 directories, 3 files

 

As you can see, cloning the GitHub repo only pulled the files tracked by GitHub! Now we can run ‘dvc pull’ to pull the current state of the data directory from the remote S3 file storage system.

  1.  dvc pull
  2. tree
 
(FoxyPY) DevBox@Ec2:~/Blog/new_collab/dvc_tst$ dvc pull
A       data/                                                                        
1 file added and 1 file fetched                                                      
(FoxyPY) DevBox@Ec2:~/Blog/new_collab/dvc_tst$ tree
.
├── README.md
├── data
│   └── dvc_tracked_file.txt
├── data.dvc
└── github_tracked_file.txt
 
1 directory, 4 files
(FoxyPY) DevBox@Ec2:~/Blog/new_collab/dvc_tst$

 

Conclusion

The model lifecycle involves continually updating scripts, parameters, and training data in order to improve model performance. This can become a messy process without proper version control. While github is great at providing a version control system for the scripts, it cannot handle the larger files such as training data, and the model weights. Enter DVC! This system enables you to use your favorite version control system, tie it into remote file hosting systems (such as S3), and continually track the current state of your model, as well as all the files used to produce it. This can be a lifesaver for reproducibility, stability, and easier sharing with collaborators!