January 24, 2024
XetHub > Google Drive for LLM Fine-tuning in Colab
Srini Kadamati
Google Colab is a cloud notebook with attached compute that has become a very popular way to load, explore, and fine-tune large language models (LLMs). Colab gives free users access to CPU and GPU compute units, with the option to upgrade to Colab Pro for more compute and less restrictions.
We love Colab for prototyping and quick exploration, but we believe it falls very short because of the storage and versioning options that are available in Colab.
Google Drive… for data and model storage?
Google Colab prefers that you store datasets and models in Google Drive, which has very poor ergonomics for data professionals.
Browsing & building context
Google Drive is fine for photos and text documents, but does a poor job rendering JSON, tabular datasets, model files, and more. This makes discoverability of datasets and models very difficult.
Access Patterns
While you can mount entire folders from Google Drive in Google Colab, the access patterns are otherwise quite limited for materializing specific files or trying to access previous versions of the same files.
Lack of Version Control
Colab users have turned to alternatives like Hugging Face to host their ML model files, blob stores (like S3) to host raw datasets, and GitHub for code & notebook versioning. To version and keep track of your long-running ML experiments, you need to version in 3 different places and somehow manage all of this overhead yourself. This problem is further magnified in a team environment.
In addition, it's extremely easy for you to overwrite files and very clunky to revert to an older version!
Poor collaboration features
While Google Docs, Sheets, etc are associated with live collaboration, Google Drive itself offers little features for collaboration, especially with collaborators who don't have Google accounts.
XetHub as a Google Drive Alternative
XetHub is a new kind of version control repo that can scale to handle large file types (up to 100 terabytes), provides useful context for most file types, has rich collaborative features (issues, pull requests, etc), and supports multiple access patterns.
This means that:
your commits can contain ALL of the context from a specific ML experiment
you can reproduce any past ML work because the specific dataset, model, and code can be rewinded to
you only need to share access to the repo, not 3 different repos in 3 different tools
To showcase the workflow, we’ll fine-tune Meta’s CodeLlama model in Google Colab from a XetHub repo.
Getting Started
Start by creating a free XetHub account and forking our XetHub repo.
Save a copy of the Colab notebook to your own Google Drive so you can edit it.
Run the first 3 cells in the notebook first. In the 3rd cell, you’ll be asked to fill out:
Your XetHub username
Your XetHub email
Run the rest of the cells in the Colab notebook. Here’s a breakdown of the steps you’re executing in this notebook with links to the relevant cell:
Install libraries for fine-tuning. Link
Install Git-Xet so you can access large files from XetHub. Link
Use Git-Xet to lazy clone our repo and then materialize (or download) the Code Llama 7B model. Note that this may take a few minutes as gigabytes of files are downloaded to Colab’s local filesystem and then models loaded into memory. Link
We then establish the baseline performance by asking the model to generate some code that’s contextual to our PyXet library. We notice how the generated code is highly erroneous. Link
Then, we fetch the source code for PyXet, tokenize it, and finetune the model using LoRa. Link
We then load the new model checkpoint and ask it to perform the same code generation task we did in the baseline. Link
We end by loading the new weights back into the original model, creating a git commit, and pushing the changes back to XetHub. Link
Every change in XetHub is a Git commit and you can get helpful context on what changed.
In addition, XetHub natively supports rendering of common data formats and model files. Check out this model visualization in our file browser.
Experiment Further
Finetune Code Llama with your own Source Code Repo
We used our very own PyXet library in our example but with just a small change, you can run this Colab notebook to fine-tune Code Llama to generate better code in the context of your own database, library, or other software project!
We recently created the XetCache library for improving the reproducibility & rerun experience in Jupyter Notebook. Let’s see if we can fine-tune Code Llama to generate valid code for us using this library.
Add a new cell early in the notebook to create and checkout to a new branch:
Let’s establish a baseline by asking the Code Llama model to generate some relevant code for us.
Change the repo we want to fine-tune on in this cell.
Run the rest of the cells to run LoRa (which will take a while) and re-evaluate the same prompt. The baseline response was highly inaccurate (XetCache is a Python package, not an npm one):
The fine-tuned response is nearly a perfect match:
Use Git to commit the fine-tuned model on our finetune/xetcache branch and push it back to XetHub.
Next Steps
We hope you give this workflow a try! If you have feedback or run into issues, you can join us in our Xet Community Slack.
If you want to learn more about XetHub, check out our homepage here.
Share on