November 16, 2023

XetData: Scale your GitHub Repos to 100 Terabytes

Srini Kadamati

Announcing our XetData integration for GitHub!

We’re excited to announce that XetData is now in beta. XetData turns GitHub into a single source of truth for machine learning projects by enabling you to version code, datasets, models, and other large assets in the same repo.

With XetData, after you configure our GitHub app and install our Git extension, the tooling fades into the background and you can continue using your familiar Git commands.

Versioning your machine learning development lifecycle in a single system lets us build some unique features and enable some unique workflows.

Dataset and model diffs

We can compare the same model file across 2 different Git branches and show useful model architecture visualizations powered by Lutz Roeder’s Netron library.

Automatic reproducibility

Most ML frameworks promise reproducibility only if you make major workflow changes, learn new commands, and lock into their system. And even then, there’s often drift between source code (managed by GitHub) and their MLOps system (for versioning datasets and models).

With our XetData integration, you can include all of the code, datasets, and models in your commits and pull requests. Reviewers then have all of the context they need to give useful feedback and help you confidently merge your work into the repo. When you’re trying to identify issues with your model in production 3 months later, you can go back to the specific commit and recreate the work again.

Over time, you can implement a lightweight ML versioning philosophy using commits and branches. You can read more about our opinionated, branch-centric ML versioning framework here.

Tap into the Git tooling ecosystem

Because our Git-Xet extension uses Git under the hood, we work with most Git tools out of the box! For example, you can carry out most common Git operations from inside VS Code or PyCharm using their existing no-code Git features. You can learn more in our docs.

How it Works

Staging changes and making commits

Every time you stage changes (git add .)or make a commit (git commit -m “commit message”), the Git-Xet filter runs behind the scenes to detect duplicate blocks of data in your files. This helps speed up the upload process, reduces storage space as well, and encourages practitioners to version more liberally because incremental changes are very cheap.

Pushing up your changes

When you push your changes to GitHub (git push origin), the Git-Xet client:

  • pushes raw binary files (videos, images, ML models, etc) and large files (big CSV’s) are to XetData

  • pushes raw source code and just file hashes for the large & binary files above to GitHub

Because we’re built on top of Git, we don’t require you to learn a new set of commands that you need to run in addition to your Git commands. Compare the XetData approach to the DVC approach:

DVC workflow image credit to Evgenii Munin

Block-based Deduplication

If the changes in the commit(s) you’re pushing overlap heavily with what’s in your main branch, then the upload can be quite fast because of the deduplication detection step that happened earlier. We invented our own deduplication technique and even published a paper at CIDR'23 on our approach.

On average, we find that we’re 5-8x faster than DVC, S3, and LakeFS when uploading large files because our deduplication.

XetHub vs XetData

Our founding team built Apple’s internal ML data and compute platform from the ground up. They witnessed firsthand the challenges machine learning engineers and data scientists faced when it came to versioning machine learning datasets and models. The teams there had tried Git, Git LFS, custom S3 versioning schemes, and DVC but none of them were widely used and users were unhappy with all of their approaches.

The early team set out to sea by creating a blob store with version control built in, as if Git and S3 had a baby. We built a complete platform with a lot of batteries included and called it XetHub.

  • Git repo management (VCS)

  • Website with a UI for collaboration (viewing commits, PRs, etc)

  • Compute platform for deploying Streamlit, Gradio, and other Python apps

To really tap into the power of versioning data, code, and models in one repo, our users had to migrate everything over to XetHub. For many teams, they were happy to do so for the platform benefits.

But we asked ourselves — could we bring the power of XetHub to GitHub? Could we upgrade the GitHub experience to continue managing code but have XetHub take over large files like datasets and models? XetData is what emerged out of that exploration and we’re excited to have you try it out. 🎉

Git Started

While we’re in beta, XetData is completely free of charge. Store and version large files in your GitHub repos to your heart’s content! To get started, follow the Quick Start on our XetData GitHub app page.

In summary, follow these 3 steps to try it for yourself:

  • Add our GitHub app to one of your GitHub repos

  • Install our tiny Git-Xet extension

  • Run through the Git lifecycle (add, commit, and push) 🎉

Get Help

If you have questions or run into issues, reach out to us in our Slack community! You can also file bugs in our public GitHub issue tracker.

Share on