Version Your S3 Datasets & Models with Git

Sam Horradarn

Joe Godlewski

Your models deserve better than S3

An important piece of ML development is the ability to look back at past versions of models and the dataset used to create those models. Versioning is about change management and as humans we contextualize it to our work in some way. While object storage services like AWS S3 have become the default platform for storing model and dataset artifacts, their versioning capability is focused on tracking changes at the individual artifact level. Turning on S3 versioning can also be prohibitively expensive, requiring backups of every change, even if only a single line has been modified.

For ML, however, knowing the set of files related to a particular version, and what has changed between versions, is very important. Users want context and documentation for the version and the changes contained within — for both the artifacts and code used to generate them — in order to understand why a model’s behavior has changed.

XetHub as a versioned object store

XetHub provides versioning in a way that alleviates many of the issues described earlier. Xet repositories are backed by Git and scaled to support over 10TB per repo to match the needs of modern ML workflows. Like Git, it uses commits to track version across all files in a repository, allowing users to view snapshots of their data at any particular iteration. Users can even use XetHub to track code and artifacts together, removing the need for extra tooling to coordinate the versions of code and data across multiple systems.

With XetHub’s built-in block-level deduplication, additional versions of files only require storing changed blocks of data, so iterations on data and models are stored much more efficiently.

If you already have your datasets and artifacts in S3, it can be a heavy lift to manually download all the data to a local machine and move it into XetHub just to try it out. To make testing XetHub versioning on your files easier than ever, we have just released a S3 import function that periodically syncs a S3 bucket to a XetHub repository.

Flex your new versioning muscles

Follow our instructions to import and sync your S3 bucket with a new XetHub repository. Each sync will move files from S3 to XetHub and commit them with a message that shows where the file was copied from, allowing you to try out XetHub risk-free.

Once your files are in XetHub, install our Xet CLI and try these Xet access patterns to get the most out of your newly versioned assets:

Read the latest version of your files without needing to download anything

Access a copy of your files from a week ago

xet ls

Grab the output file associated with a certain commit

xet cp

And easily explore your files through our UI.

‍

🎉 Making the switch

Loving your versioned XetHub view and functionality? Easily move away from S3 by running a one-time S3 import (with no sync) and start writing your files to XetHub instead of S3. This will allow you to fully manage your large ML assets using Git semantics, Python, or the Xet CLI. For example, writing to a Xet repository is as easy as:

git add 
git commit -m “Adding file”
git

Python

fs = pyxet.XetFS()
with fs.transaction as tr:
    tr.set_commit_message("Write file to repository")
		fs.open("///", \
       'w').write("Hello world!")

Xet CLI

xet cp

XetHub is free for all public use and private repositories under 20GB — try it today!

Next Steps

If you're using S3 to version your large models and datasets, we'd love to hear from you! You're welcome to reach out to us over email or join our Slack .

Share on