December 8, 2023

Our New, Git-Centric ML Versioning Framework

Srini Kadamati

The Messy ML Workflow

Machine learning practitioners are used to gluing together multiple tools for experiment tracking, hosting datasets, versioning datasets, hosting models, versioning models, reviewing changes, and monitoring models. There are entire products that teams hire to focus on solving each of these problems, which causes:

  • model reproducibility to be very difficult

  • challenges onboarding new team members or interfacing with other teams

  • friction for teams looking to move fast in ML


We believe that the root of the problems center around the limitations of Git.

While Git and hosted Git solutions like GitHub and GitLab have become extremely popular in the world of software engineering, they struggle to store and version even 100 MB datasets and models. ML teams are left to use other tools like DVC, S3, and HuggingFace in addition to Git and GitHub.

At XetHub, we asked ourselves a simple question. What if we could just solve the limitations Git had? Could we scale Git to handle terabytes of large files?

Scaling Git would enable ML practitioners to version datasets, code, and models in the same repo. After we solved these problems, we wrote about how we solved them, and launched the XetHub platform, people started to ask us for advice on:

  • When to make Git commits?

  • How should we separate, name, and categorize our branches?

  • How do we know what model is in production?

In this post, we’ll provide an overview of our opinionated ML versioning framework and a few resources to dive deeper.

An Approachable Overview (Talk)

Recently, our team member Yonatan Alexander gave a fantastic talk at pyGrunn (30min) that is approachable and a good starting point.

The Framework in Detail

Next, we recommend reading Yonatan’s blog on Towards Data Science that dives into incredible detail and provides a clear playbook you can take with you to use in any tool.

Branches are at the heart of our ML versioning framework. In our point-of-view, using branches deliberately helps you and your team experiment freely and confidently as well as separate model discovery from delivery work.

We believe that you should maintain a few different types of branches as you work on an ML project:

  • data: mainly contain datasets and documentation

  • analysis: run analysis, A/B tests, etc

  • stable: active branches for training & inference

  • coding: meant for code development and active data exploration

  • monitoring: contains data, commit tag, and model prediction in prod → useful for detecting data drift

To dive deeper, we recommend reading the full post here.

An Example Repo

Once you’re ready to get your hands dirty, we recommend cloning and playing with our example repo hosted on XetHub.

‍Deduplication

Immediately, you’ll notice that our deduplication technology reduced the total repo size from 120 megabytes to 54 megabytes. This is part of our not-so-secret sauce to scaling Git to handle large files.

Because of our efficient deduplication, remixing datasets or creating branches with training & test sets from full datasets often doesn’t consume more storage space. It also makes switching branches and uploading changes back to a central repository significantly faster.

Not a XetHub user? We also built a GitHub integration that scales your GitHub repos to handle large models and datasets next to your code and our versioning framework shines there as well.

Exploring the Repo locally

First, signup for a free account and then head to our quickstart page. You’ll need to install our Git extension and authenticate using a personal access token.

Then, run the following command:

git


Then, list all of the branches using:

git branch -r

From here, you can use all of the familiar Git commands to explore the repo.

# Check out the PCA model in development
git checkout origin/dev-pca-lr
cd data/
# View the train & test datasets
ls -al

# Check out the current model in production
git checkout origin/monitoring
cd models/
# View model files
ls -al

Join our Community

If you tried this framework for yourself, we’d love to hear from you! Join our Slack community here.

Share on