December 18, 2023
Train ML Models Faster in Kubernetes by Mounting Training Datasets
Assaf Vayner
Srini Kadamati
Waiting on Data
Training large models can take a long time and expensive resources like GPUs can easily spend most of their time just loading data. We’re willing to bet that “my models are still training” is the new “my code’s compiling” — but even more costly.
While building up the ML data and compute infrastructure at Apple, several of our team members witnessed this problem at scale. Internal teams would download the same huge datasets over and over again in each node of their distributed training jobs, sometimes spending more time on the download than actual training. To improve training speed and GPU utilization, we tried a variety of approaches like updating data loading logic and chunking datasets. These helped, but seemed heavy-handed. A more natural way would be to simply provide file system access to the containers themselves.
Our Solution: Mount
What if you could just mount your large datasets and stream whatever you need to access just-in-time as your code references specific chunks of data? We’ve built two plugins to support the most popular container types around:
Docker Plugin
It’s usually difficult to mount on Docker because mounting requires elevated CAP_SYS_ADMIN
privileges. To get around this, we created a Docker plugin that uses a volume plugin to connect to XetHub — enabling read-only mounts of XetHub repos from inside your Docker container without any extra permissions.
Kubernetes CSI Plugin
Naturally, once we released our Docker plugin, we were asked about supporting mount on Kubernetes! Kubernetes is being used more and more frequently for efficiently orchestrating ML workflows.
We built a simple node CSI plugin to mount your XetHub repo and fetch files on the fly so your primary applications can use the data without up-front downloads. The plugin sets up a read-only ephemeral volume that uses our git-xet mount process to access your repository.
Our plugin is entirely open source and you can find it here on GitHub.
Getting Started
Our installation process is very simple and currently only utilized kubectl over helm charts (open an issue you want helm charts!). You can find up-to-date documentation here.
The simplest way to install our plugin is to download and run our install script which you can do with the following one-liner or following the local install steps to run the install script as documented in our README:
Once you have the plugin installed you can create volumes and use them from within your pods! To set up a volume, add the volumes section to your pods configuration files:
Then in your pods containers section, add a volume mount referencing the volume name created above: xet-flickr-30.
Once you apply these changes, your container will have access to your XetHub repo under the mount path.
To set up a volume with a private repository you will need to create a secret in Kubernetes. Please follow the documentation in our README for how to do this.
Why Rust? 🦀
What can we say — we just love Rust! We wrote an NFS server implementation (nfsserve) and our Docker plugin (docker-volume-xetfs) both in Rust.
Most Kubernetes CSI drivers out there in the ether are written in Golang, as is most of the backing components of Kubernetes. The CSI spec is in essence a well laid-out gRPC spec so we decided to lean into our love for and expertise in writing Rust.
Contributing
See room for improvement? Please contribute to help us improve! You're also welcome to join our Slack community, where you can interact with our team.
Share on