Introducing XetHub

Ann Huang

Data is hard!

Everyone just expects data storage to work. They want to put something somewhere, access it, and never worry about it again.

We first realized this when designing Apple's internal ML storage platform, and our naive reaction was: Easy! Let's provide a blobstore and let them come to us. And so they did. Adoption was widespread and traffic skyrocketed, and let's just say that we have battle scars from our success.

With the good came the bad. The more we looked at our system, the more we saw the inefficiencies and reproducibility issues related to playing loose and disorganized with data. So we built a second internal product to help track structured data and datasets, with rigid structures to ensure reproducibility and audit trails. But this solution wasn't flexible enough for teams who were used to their own data structures, tools, and flows.

Our rigid solution solved for provenance and review, but was too heavyweight for quick iteration. Our loose solution solved for ease-of-use, but led to overwhelming storage bloat (one usage audit showed a shocking number of duplicate datasets in the system, each hundreds of gigabytes large) and other untracked versioning issues. Somewhere in between these two ends of the spectrum was the ideal solution; one that would allow for quick exploration and iterative experiments while also being storage-efficient and versioned.

The quest to create this solution for everyone led us to the formation of XetHub. We didn't want to be overly biased by our experiences, so we spoke to data scientists, ML engineers, managers, and other data-adjacent potential users across various fields to hear about their pains. The more people we talked to, the clearer the requirements became:

Flexible file support
Speed at scale
Seamless collaboration

Introducing XetHub

We're excited to announce the public beta of XetHub, a collaborative storage platform for data management. XetHub aims to address each of the above requirements head-on towards our end goal: to make working with data as fast and collaborative as working with code.

Flexible file support

Files are the lowest common denominator for working with data, but what types of files should be supported? For instance, ML data can be tensors, text, audio, images, and more. Even "simple" tabular data comes in various formats (e.g., CSV, Parquet, HDF5, SQLite), each with its own advantages and disadvantages. There is no one-size-fits-all file type.

Yet, how many times have you followed a link to a promising product, only to be stopped cold when you realize that it doesn't support the file types that you need? This makes perfect sense from a prioritization and scoping lens, but adds pain to users, who must then scramble to reformat their data to fit the tools they want to use.

To truly support our users, we must support all tools and workflows, without adding friction. We do so naturally by leveraging an existing interface that everyone is already using for code: Git — with magic scaling to support huge repositories.

With XetHub, users can run the same flows and commands they already use for code (e.g., commits, pull requests, history, and audits) with repositories of up to 1 TB. Our Git-backed protocol allows easy integration with existing workflows, with no need to reformat files or adopt heavyweight data ecosystems, and also allows for incremental difference tracking on compatible data types.

# Clone large repositories
git xet clone <XetHub Git URL>

# Normal
git checkout
git push
git pull
git

‍One more thing

That's not all. We knew that we didn't want to repeat our storage bloat issues, so we developed a fast, powerful, and efficient data deduplication method that supports arbitrarily file formats.

While our default deduplication algorithm works quite well, it can also be customized for specific file types to improve efficiency. For instance, our optimized CSV specific chunker allows CSV file subsamples and permutations to be created with minimal additional storage. Curious how much deduplication is saving you? Check out the Materialized and Stored sizes listed on each repository page to see the statistics.

Speed at scale

"I love spending hours troubleshooting how to access my data!", said no one ever. Working with small data is easy. You can download it to your desktop and work directly with it, then put it on a file share or send it via email or Slack 🙈. If it's small enough, you can even put your data files in GitHub or GitLab for automatic tracking. At scale, however, things start to break down. The questions of where best to store data, how to access it, and what the limitations of each access environment have become a tax that each user has to pay.

Data scientists, ML engineers, and other data-users spend an unpredictable number of hours in exploratory data analysis, featurization, and experimentation to test hypotheses. In a world where both human time and machine time are extremely costly, the ability to get your data easily, quickly, and consistently from any environment — without paying the troubleshooting tax — is essential.

XetHub lets users track and access all of their data in one place, with the ability to access it from anywhere at the speed of localhost. Our Xet mount feature provides a read-only view of any repository at any commit that loads in seconds, ready for access for use with any tool you have at hand.

git

‍When we say anywhere, we mean it. File system mount is also the easiest and most efficient way to access large data from GPU clusters. Stop spending compute time on data transfer and forget about manifests; just mount everything and read what you need. Run the same code to access data locally and in the cloud, making it easy to transition from development to production. Want even more speed? Dedicated high-performance distributed caches can be deployed next to your machines for wire-speed levels of throughput.

Sound too good to be true? Try it for yourself on our clone of the Laion 400M dataset.

Seamless collaboration

Few ML engineers work in isolation, but being on a team doesn't always translate to effective communication, especially when it comes to shared development. Our experience and user research both show that collaborating on shared data projects is a universal pain point because there is so much to keep in sync. What's the most recent version of the data? What has changed since the last revision? Why did it change?

This isn't a new problem, and there are great solutions... for code. The "hub" view of a code repository, as popularized by GitHub and GitLab, is perfect for collaborating on small repositories of up to a few gigabytes. Existing solutions deal with big data with either links to opaque storage buckets or using pointer files, leaving it to the user to figure out their own data management story. The interface requirements for large scale repositories are different, especially when it comes to browsability and understandability.

While our Git-backed internals ensure that every repository change is tracked, XetHub's web interface improves usability for large repositories, adding a file browsing pane for convenient exploration of complex directory structures with all types of files as well as built-in views for compatible large file types.

In addition to the typical Git-backed interface flows we're used to for code (e.g., issues, pull requests), we've also added the following features to improve team collaboration:

Proposing edits or adding new data? Add custom visualizations to your repository for rendered charts that update and display on comparison and pull request views.
Have huge CSV files? Our CSV file view shows column-wise summary statistics to help you understand the contents of a huge table at a glance. These statistics also show up on comparison views, like in this sample pull request.

Most importantly, by moving past pointer files and versioning data in the same way that we version code, we guarantee experimental reproducibility at every commit. Discover a dataset issue? Fix it in the repository and start a pull request to record and share the context of the update. Troubleshooting a performance regression? Mount or clone multiple versions of your XetHub repository to run comparisons.

While we've started small with updates for CSV files and custom visualizations, our goal is to make our platform flexible and extensible. Everyone should have the power to customize XetHub to fit the needs of their team and workflow.

What next?

Our work has just begun! We're excited about what we've built and can't wait to hear your feedback. Sign up for XetHub today to leverage the power of Git at 100x the scale — and let our data storage work for you.

Psst... like what we're doing? Join us on our quest to improve data collaboration!

Share on