Data Version Control in Machine Learning Models
Sep 22, 2022By Jake Gellatly, Junior Machine Learning Engineer
Introduction: The core cycle of Model development
When developing ML models, there is a continuous feedback loop in order to improve their performance. Initially, there is a wave of data gathering, preparation, wrangling, and analysis in order to ensure enough data for representative classes and/or objects. After training and testing version one of the model, it can then be sent to deployment in order to address its particular use case. However, this is just the first round of the model life cycle.
Problems: Integration of new data into the model
Once the model has been released into the wild, it will often encounter new data that was not adequately covered in the training dataset. To address this, there can be a human in the loop for the model development process. They will evaluate the models performance on real world data, correct annotations on misclassified images, and then send these images back into the model as training data. This process, in theory, could be a never ending cycle. Continually adding new training data into the model will enable it to perform better on edge cases that may not have been present in the initial training dataset. Managing these datasets as they continue to grow is no small task, and if improperly structured, can lead to messy data and scripts that are not easily reproduced in future analysis.
Solutions : Data Version Control
Data Version Control (DVC) (https://dvc.org/) is a software used to manage both the scripts and data that are used in model development. Unlike git which can only store small text-based files, DVC allows you to apply the same version control principles to large datasets that cannot be stored on git themselves. It allows you to take “snapshots” of the data, features, configuration, code, and the model itself, and stage them before committing to a version control software.
For our problem of integrating new data into a model, this is a perfect solution! It allows you to continuously improve a model, track the data that went into training, and retrain. Should something go wrong, it is quick and easy to revert the model to a previous commit. Further, it allows you to quickly and easily share the data and models by simply cloning the GitHub repo that tracks the model and running ‘dvc pull’.
Below we will show a small working example of how to get started with DVC.
Implementation
Part 1: GitHub initialization
In order to get started, you must install DVC. If you wish to use a remote file storage system (example below), you must also have it properly configured. In our case, this is done beforehand with the AWS CLI. Then follow the steps below in order to set up the GitHub repo.
- git clone https://github.com/FoxyAI/dvc_tst/
- cd dvc_tst/
- echo “dvc_test_commit” >> README.md
- git add README.md
- git commit
- git push
(FoxyPY) DevBox@Ec2:~/Blog$ git clone https://github.com/FoxyAI/dvc_tst/ Cloning into 'dvc_tst'... remote: Enumerating objects: 11, done. Ppl remote: Counting objects: 100% (11/11), done. remote: Compressing objects: 100% (6/6), done. remote: Total 11 (delta 0), reused 7 (delta 0), pack-reused 0 Unpacking objects: 100% (11/11), done. (FoxyPY) DevBox@Ec2:~/Blog$ cd dvc_tst/ (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ echo "dvc_test_commit" >> README.md (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git add README.md (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git commit [main eab93e5] commit 1 file changed, 1 insertion(+) create mode 100644 README.md (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git push Counting objects: 3, done. Writing objects: 100% (3/3), 249 bytes | 249.00 KiB/s, done. Total 3 (delta 0), reused 0 (delta 0) To https://github.com/FoxyAI/dvc_tst/ a2bc499..eab93e5 main -> main
Part 2: DVC initialization
Now that the GitHub repo is properly set up, you can initialize DVC in order to track larger files (training data, model weights, etc) in a folder contained within the GitHub repo.
- dvc init
- mkdir data
- dvc add data
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc init Initialized DVC repository. (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ mkdir data (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc add data 100% Adding...|███████████████████████████████████████|1/1 [00:00, 27.86file/s]
Now DVC will track any files contained within the data folder, while GitHub will track all scripts and smaller files that are contained outside of this folder. To illustrate this, we can add a few test files to see how this works. First, we can use git status in order to get everything synchronized with GitHub.
- git status
- git add .gitignore data.dvc
- git commit
- git push
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git status On branch main Your branch is up to date with 'origin/main'. Changes to be committed: (use "git reset HEAD <file>..." to unstage) new file: .dvc/.gitignore new file: .dvc/config new file: .dvcignore Untracked files: (use "git add <file>..." to include in what will be committed) .gitignore data.dvc (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git add .gitignore data.dvc (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git commit [main 2da4203] dvc start 5 files changed, 12 insertions(+) create mode 100644 .dvc/.gitignore create mode 100644 .dvc/config create mode 100644 .dvcignore create mode 100644 .gitignore create mode 100644 data.dvc (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git push Counting objects: 8, done. Delta compression using up to 4 threads. Compressing objects: 100% (5/5), done. Writing objects: 100% (8/8), 695 bytes | 695.00 KiB/s, done. Total 8 (delta 0), reused 0 (delta 0) To https://github.com/FoxyAI/dvc_tst/ 770a9aa..2da4203 main -> main
Part 3: DVC file tracking
Now we can add a couple of test files to illustrate how these components come together!
- touch github_tracked_file.txt # adding a file in the main repo
- cd data/
- touch dvc_tracked_file.txt # adding a file in the dvc tracked folder
- cd ..
- tree
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ touch github_tracked_file.txt (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ cd data/ (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst/data$ touch dvc_tracked_file.txt (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst/data$ cd .. (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ tree . ├── README.md ├── data │ └── dvc_tracked_file.txt ├── data.dvc └── github_tracked_file.txt 1 directory, 4 files
Part 4: Remote file hosting
We can now link this repo to a remote file storage system. This makes it very easy for collaborators to access your data, directly from a model repository!!!
Once this is done, we rerun ‘dvc add data’ in order to update the status of the data folder (contains a new folder), and then commit and push it to GitHub.
- dvc remote add -d dvc-tst s3://dvc-tst
- dvc add data
- git status
- git add .dvc/config data.dvc github_tracked_file.txt
- git commit
- git push
- dvc push
(FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc remote add -d dvc-tst s3://dvc-tst Setting 'dvc-tst' as a default remote. (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc add data 100% Adding...|████████████████████████████████████████|1/1 [00:00, 40.69file/s] (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git status On branch main Your branch is up to date with 'origin/main'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: .dvc/config modified: data.dvc Untracked files: (use "git add <file>..." to include in what will be committed) github_tracked_file.txt no changes added to commit (use "git add" and/or "git commit -a") (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git add .dvc/config data.dvc github_tracked_file.txt (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git commit [main a0f776c] remote track 3 files changed, 6 insertions(+), 2 deletions(-) create mode 100644 github_tracked_file.txt (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ git push Counting objects: 5, done. Delta compression using up to 4 threads. Compressing objects: 100% (5/5), done. Writing objects: 100% (5/5), 622 bytes | 622.00 KiB/s, done. Total 5 (delta 0), reused 0 (delta 0) To https://github.com/FoxyAI/dvc_tst/ 2da4203..a0f776c main -> main (FoxyPY) DevBox@Ec2:~/Blog/dvc_tst$ dvc push 2 files pushed
Part 5: Sharing and synchronization
Now everything is fully set up! Any changes within the data folder can be tracked with ‘dvc add data’, followed by a git commit and push. These changes to your data folder are version controlled and synchronized to the S3 external storage system.
In order to illustrate this concept, we can test how this repo behaves for a “new” user, whether it be a collaborator who you want to share scripts and data with, or your future self updating a model with new data! First, we can migrate to a new directory, clone the repo, and run ‘tree’ to see the files contained.
- mkdir new_collab
- cd new_collab/
- git clone https://github.com/FoxyAI/dvc_tst/
- cd dvc_tst/
- tree
(FoxyPY) DevBox@Ec2:~/Blog$ mkdir new_collab (FoxyPY) DevBox@Ec2:~/Blog$ cd new_collab/ (FoxyPY) DevBox@Ec2:~/Blog/new_collab$ git clone https://github.com/FoxyAI/dvc_tst/ Cloning into 'dvc_tst'... remote: Enumerating objects: 35, done. remote: Counting objects: 100% (35/35), done. remote: Compressing objects: 100% (26/26), done. remote: Total 35 (delta 4), reused 20 (delta 0), pack-reused 0 Unpacking objects: 100% (35/35), done. (FoxyPY) DevBox@Ec2:~/Blog/new_collab$ cd dvc_tst/ (FoxyPY) DevBox@Ec2:~/Blog/new_collab/dvc_tst$ tree . ├── README.md ├── data.dvc └── github_tracked_file.txt 0 directories, 3 files
As you can see, cloning the GitHub repo only pulled the files tracked by GitHub! Now we can run ‘dvc pull’ to pull the current state of the data directory from the remote S3 file storage system.
- dvc pull
- tree
(FoxyPY) DevBox@Ec2:~/Blog/new_collab/dvc_tst$ dvc pull A data/ 1 file added and 1 file fetched (FoxyPY) DevBox@Ec2:~/Blog/new_collab/dvc_tst$ tree . ├── README.md ├── data │ └── dvc_tracked_file.txt ├── data.dvc └── github_tracked_file.txt 1 directory, 4 files (FoxyPY) DevBox@Ec2:~/Blog/new_collab/dvc_tst$
Conclusion
The model lifecycle involves continually updating scripts, parameters, and training data in order to improve model performance. This can become a messy process without proper version control. While github is great at providing a version control system for the scripts, it cannot handle the larger files such as training data, and the model weights. Enter DVC! This system enables you to use your favorite version control system, tie it into remote file hosting systems (such as S3), and continually track the current state of your model, as well as all the files used to produce it. This can be a lifesaver for reproducibility, stability, and easier sharing with collaborators!