November 28, 2023

Streamline distributed ML training with Docker filesystem mounts

Ann Huang

Sam Horradarn

Joe Godlewski

Distributed training at scale is expensive. In past lives, we’ve seen GPU resources sit idle while waiting for workers to become available and then continue to sit idle as each worker slowly downloads data, and then sometimes altogether fail due to an out-of-memory error halfway through. At XetHub, we want to stop the waste and make it easy for ML engineers to access their data without having to babysit it.

Typical large scale distributed ML training jobs use a simple formula: run a Docker container with some compute that loads data from some object store. But if you’ve ever had to implement such a job, you know that that’s only the tip of the iceberg. You have to start by writing code to partition your large data and coordinate worker activity so that each node is picking up the right chunk of data at the right time. And make sure your logic can handle all the possibilities, or you’ll have wasted expensive hours of compute time on an incomplete and irrecoverable job. No pressure!

What if you could just mount your large data on your Docker container and stream whatever you need to access? No data hacking, no out-of-memory errors, no pre-partitioning, no complicated logic, no slow download — just mount and train. We’ve implemented a xethub/xetfs Docker volume plugin that can be used to mount XetHub repositories as Docker volumes for use in your containers.

What’s XetHub?

XetHub scales Git to work with large files and repositories, with advanced access patterns via Python, the Xet CLI, and now a Docker plugin to streamline ML workflows. Create a repository, install the extension, and simply add, commit, and push your data to make the repository available for access. We also provide a XetData integration that allows you to add and access large data from within your GitHub repository.

Using xethub/xetfs

The xethub/xetfs plugin works like other Docker volume drivers. Let’s walk through mounting the XetHub/Flickr30k repository into a container.

On the Docker host, local or remote:

Now create the volume. Using the -o options, we can specify the repository and the commit/branch to be mounted. If you’re using our GitHub integration, simply use the GitHub repo path instead.

docker volume create --driver xethub/xetfs \
   -o repo= \
   -o commit

Once the volume is created, we can attach it to any container. This command mounts the flickr30k volume under the /app directory on an Ubuntu container and lists the contents:

docker run --rm -it -v flickr30k:/app ubuntu:latest ls -lR

Now that your volume is mounted, you can access files within the repository within seconds without downloading the whole 4.2GB repository. To show the Flickr30k README, for instance:


Why a Docker plugin?

Our mount feature provides a filesystem-like interface to repositories that lazily fetches files as needed so there’s no need for full data downloads. While you may still have to wait for your workers to become available, it speeds up data access and takes disk space and memory limitations out of the equation so that workers can just focus on getting their job done. And since mount syntax is the same for local and remote machines, there’s no need to rewrite your training code for production workloads, saving an extra debug step.

XetHub’s mount implementation relies on the mount system call to create the mounted filesystem. Unfortunately, in a containerized environment like Docker, access to the mount system call requires the CAP_SYS_ADMIN capability for the container. This capability acts as a catch-all providing access to much more than just mount/umount. In general, we’ve found that infra teams managing large containerized clusters do not recommend granting CAP_SYS_ADMIN due to the privileges it provides and associated security risks. However, Docker provides a way to setup a remote mount via Docker volumes. This is typically used to mount a directory on the host machine to a path in a container, but various volume plugins exist to connect to remote storage systems.

By implementing a Docker plugin, our users no longer need to grant CAP_SYS_ADMIN to their containers to enable mount. Now you can manage your data using Git and easily stream it from your training jobs without having to manage the I/O.

In Rust we trust

The existing implementation provided by Docker is in Golang. At XetHub, we love Rust. It enables us to write safe, high performance systems code at high velocity.

We would love the Docker and Rust community to benefit and are open-sourcing our Rust implementation for the plugin: xetdata/docker-volume-xetfs. We've also open-sourced a helper crate for building Docker volume plugins using rust: xetdata/docker-volume-rs, building on previous work done for nfsserve. We are also working on a CSI plugin to enable similar functionality for Kubernetes.

Give it a shot

Are you doing distributed ML training on with huge data? If you’re iterating on data and models, and realizing that reproducibility is challenging with your source information spread across multiple storage silos, XetHub may be the versioning tool you’ve been looking for. Check out our website or add our XetData integration for GitHub to create a single source of truth for your ML workflows.

Share on

Simplify your ML development today

Stop juggling multiple tools and streamline your workflow with XetHub.

Simplify your ML development today

Stop juggling multiple tools and streamline your workflow with XetHub.

Simplify your ML development today

Stop juggling multiple tools and streamline your workflow with XetHub.